You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Rahul Iyer (JIRA)" <ji...@apache.org> on 2017/10/31 22:10:00 UTC

[jira] [Comment Edited] (MADLIB-1167) Summary - add more statistics

    [ https://issues.apache.org/jira/browse/MADLIB-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227685#comment-16227685 ] 

Rahul Iyer edited comment on MADLIB-1167 at 10/31/17 10:09 PM:
---------------------------------------------------------------

For 4) we can report the CI for the mean using the z-score. This should be OK for most cases since this function will be run on big data (i.e. large sample). Hence we can safely assume a normal *sampling* distribution irrespective of the *sample* distribution. 

The min-max values work on all numeric types and also on text types ('varchar', 'bpchar', 'text'). For the text types it returns the min and max length of the text values. 


was (Author: riyer):
For 4) we can report the CI for the mean using the z-score. This should be OK for most cases since this function will be run on big data (i.e. large sample). Hence the sample distribution should not be an issue. 

The min-max values work on all numeric types and also on text types ('varchar', 'bpchar', 'text'). For the text types it returns the min and max length of the text values. 

> Summary - add more statistics
> -----------------------------
>
>                 Key: MADLIB-1167
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1167
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Descriptive Statistics
>            Reporter: Frank McQuillan
>             Fix For: v2.0
>
>
> In the summary function
> http://madlib.apache.org/docs/latest/group__grp__summary.html
> add additional statistics:
> 1) % positive values
> 2) % negative values
> 3) % zero values
> 4) confidence intervals (95% ?) on mean
> * does this make sense, since need to assume a distribution for the data which we probably cannot infer?
> Also please check why min and max are being reported for non-numeric cols.  Is this a bug?
> {code}
> madlib=# SELECT * FROM houses_summary where target_column='zipcode';
> -[ RECORD 1 ]--------+----------------
> group_by             | 
> group_by_value       | 
> target_column        | zipcode
> column_number        | 8
> data_type            | text
> row_count            | 15
> distinct_values      | 2
> missing_values       | 0
> blank_values         | 0
> fraction_missing     | 0
> fraction_blank       | 0
> mean                 | 
> variance             | 
> min                  | 6
> max                  | 6
> first_quartile       | 
> median               | 
> third_quartile       | 
> most_frequent_values | {94301y,84301x}
> mfv_frequencies      | {10,5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)