You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/04/24 02:41:39 UTC

[jira] [Updated] (SPARK-6117) describe function for summary statistics

     [ https://issues.apache.org/jira/browse/SPARK-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-6117:
-----------------------------
    Assignee: Andrey Zagrebin

> describe function for summary statistics
> ----------------------------------------
>
>                 Key: SPARK-6117
>                 URL: https://issues.apache.org/jira/browse/SPARK-6117
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Andrey Zagrebin
>              Labels: starter
>             Fix For: 1.3.1, 1.4.0
>
>
> DataFrame.describe should return a DataFrame with summary statistics. 
> {code}
> def describe(cols: String*): DataFrame
> {code}
> If cols is empty, then run describe on all numeric columns.
> The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and n + 1 columns. The 1st column is the name of the aggregate function, and the next n columns are the numeric columns of interest in the input DataFrame.
> Similar to Pandas (but removing percentile since accurate percentiles are too expensive to compute for Big Data)
> {code}
> In [19]: df.describe()
> Out[19]: 
>               A         B         C         D
> count  6.000000  6.000000  6.000000  6.000000
> mean   0.073711 -0.431125 -0.687758 -0.233103
> std    0.843157  0.922818  0.779887  0.973118
> min   -0.861849 -2.104569 -1.509059 -1.135632
> max    1.212112  0.567020  0.276232  1.071804
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org