You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/04/24 02:41:39 UTC
[jira] [Updated] (SPARK-6117) describe function for summary
statistics
[ https://issues.apache.org/jira/browse/SPARK-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-6117:
-----------------------------
Assignee: Andrey Zagrebin
> describe function for summary statistics
> ----------------------------------------
>
> Key: SPARK-6117
> URL: https://issues.apache.org/jira/browse/SPARK-6117
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Reynold Xin
> Assignee: Andrey Zagrebin
> Labels: starter
> Fix For: 1.3.1, 1.4.0
>
>
> DataFrame.describe should return a DataFrame with summary statistics.
> {code}
> def describe(cols: String*): DataFrame
> {code}
> If cols is empty, then run describe on all numeric columns.
> The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and n + 1 columns. The 1st column is the name of the aggregate function, and the next n columns are the numeric columns of interest in the input DataFrame.
> Similar to Pandas (but removing percentile since accurate percentiles are too expensive to compute for Big Data)
> {code}
> In [19]: df.describe()
> Out[19]:
> A B C D
> count 6.000000 6.000000 6.000000 6.000000
> mean 0.073711 -0.431125 -0.687758 -0.233103
> std 0.843157 0.922818 0.779887 0.973118
> min -0.861849 -2.104569 -1.509059 -1.135632
> max 1.212112 0.567020 0.276232 1.071804
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org