You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by jingyimei <gi...@git.apache.org> on 2017/12/22 00:58:03 UTC

[GitHub] madlib pull request #220: Add more stats to summary function

GitHub user jingyimei opened a pull request:

    https://github.com/apache/madlib/pull/220

      Add more stats to summary function

    This PR added the following statistics to madlib.summary():
        positive values
        negative values
        zero values
        95% confidence intervals on mean
    User docs is updated due to adding new fields.
    
    Besides, we rename  'row_count' to 'num_col_summarized' in summary() return result to eliminate confusion from another `row_count` in user_defined_summary_table.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jingyimei/madlib summary_more_stats

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/220.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #220
    
----
commit cbd3cd4c0af5005ff4ea7d82cab74325221d4311
Author: Jingyi Mei <jm...@...>
Date:   2017-12-20T13:18:05Z

    Add more stats to summary function
    
    This commit added the following statistics to madlib.summary():
    positive values
    negative values
    zero values
    95% confidence intervals on mean
    
    User docs is updated due to adding new fields.

commit da7fea93b6fab72451643a0264b6e841448c9a4d
Author: Jingyi Mei <jm...@...>
Date:   2017-12-21T05:34:34Z

    Rename 'row_count' to 'num_col_summarized' in summary() return result
    
    Previously, when we run `SELECT * FROM madlib.summary(valid_inputs)`, it returns a
    composite type containing a filed named `row_count`, which refers to the
    number of rows in the output table.
    when we run `SELECT * FROM user_defined_summary_table;`, it also
    contains a column named `row_count`, which refers to number of rows for
    the target column.
    
    To eliminate the confusion, we rename the first `row_count` to
    `num_col_summarized`, modify explanation and also update user doc.

----


---

[GitHub] madlib pull request #220: Add more stats to summary function

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/220#discussion_r158424408
  
    --- Diff: src/ports/postgres/modules/summary/Summarizer.py_in ---
    @@ -199,6 +200,22 @@ class Summarizer:
             args['max_columns'] = ','.join([minmax_type('max', c) for c in cols])
     
             args['ntile_columns'] = "array_to_string(array[NULL], ',')"
    +
    +        args['positive_columns'] = ','.join(["sum(case when {0} > 0 \
    +                                           then 1 else 0 end)".format(c['attname'])
    +                                          if c['typname'] in numeric_types
    +                                          else 'NULL' for c in cols])
    +
    +        args["negative_columns"] = ','.join(["sum(case when {0} < 0 \
    +                                           then 1 else 0 end)".format(c['attname'])
    +                                          if c['typname'] in numeric_types
    +                                          else 'NULL' for c in cols])
    +
    +        args["zero_columns"] = ','.join(["sum(case when {0} = 0 \
    --- End diff --
    
    If we are using float, maybe this will be more accurate.


---

[GitHub] madlib pull request #220: Add more stats to summary function

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/220#discussion_r158412746
  
    --- Diff: src/ports/postgres/modules/summary/Summarizer.py_in ---
    @@ -199,6 +200,22 @@ class Summarizer:
             args['max_columns'] = ','.join([minmax_type('max', c) for c in cols])
     
             args['ntile_columns'] = "array_to_string(array[NULL], ',')"
    +
    +        args['positive_columns'] = ','.join(["sum(case when {0} > 0 \
    +                                           then 1 else 0 end)".format(c['attname'])
    +                                          if c['typname'] in numeric_types
    +                                          else 'NULL' for c in cols])
    +
    +        args["negative_columns"] = ','.join(["sum(case when {0} < 0 \
    +                                           then 1 else 0 end)".format(c['attname'])
    +                                          if c['typname'] in numeric_types
    +                                          else 'NULL' for c in cols])
    +
    +        args["zero_columns"] = ','.join(["sum(case when {0} = 0 \
    --- End diff --
    
    should we check equality to 0 exactly like this, or abs value < SMALL ?


---

[GitHub] madlib issue #220: Add more stats to summary function

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the issue:

    https://github.com/apache/madlib/pull/220
  
    This can be closed since merged by d025bb4609baeb7c7a1d136590780a8fafdee208. 


---

[GitHub] madlib pull request #220: Add more stats to summary function

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei closed the pull request at:

    https://github.com/apache/madlib/pull/220


---

[GitHub] madlib pull request #220: Add more stats to summary function

Posted by orhankislal <gi...@git.apache.org>.
Github user orhankislal commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/220#discussion_r158568809
  
    --- Diff: src/ports/postgres/modules/summary/Summarizer.py_in ---
    @@ -199,6 +200,22 @@ class Summarizer:
             args['max_columns'] = ','.join([minmax_type('max', c) for c in cols])
     
             args['ntile_columns'] = "array_to_string(array[NULL], ',')"
    +
    +        args['positive_columns'] = ','.join(["sum(case when {0} > 0 \
    +                                           then 1 else 0 end)".format(c['attname'])
    +                                          if c['typname'] in numeric_types
    +                                          else 'NULL' for c in cols])
    +
    +        args["negative_columns"] = ','.join(["sum(case when {0} < 0 \
    +                                           then 1 else 0 end)".format(c['attname'])
    +                                          if c['typname'] in numeric_types
    +                                          else 'NULL' for c in cols])
    +
    +        args["zero_columns"] = ','.join(["sum(case when {0} = 0 \
    --- End diff --
    
    In graph algorithms such as SSSP and APSP, we used `EPSILON = 0.000001` for float comparisons. 


---

[GitHub] madlib pull request #220: Add more stats to summary function

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/220#discussion_r158572631
  
    --- Diff: src/ports/postgres/modules/summary/Summarizer.py_in ---
    @@ -199,6 +200,22 @@ class Summarizer:
             args['max_columns'] = ','.join([minmax_type('max', c) for c in cols])
     
             args['ntile_columns'] = "array_to_string(array[NULL], ',')"
    +
    +        args['positive_columns'] = ','.join(["sum(case when {0} > 0 \
    +                                           then 1 else 0 end)".format(c['attname'])
    +                                          if c['typname'] in numeric_types
    +                                          else 'NULL' for c in cols])
    +
    +        args["negative_columns"] = ','.join(["sum(case when {0} < 0 \
    +                                           then 1 else 0 end)".format(c['attname'])
    +                                          if c['typname'] in numeric_types
    +                                          else 'NULL' for c in cols])
    +
    +        args["zero_columns"] = ','.join(["sum(case when {0} = 0 \
    --- End diff --
    
    Orhan when you say "comparison" between FLOATs, then yes, we do need to consider rounding.    
    
    But if we are checking *explicit* equality to 0 in summary() function, then we should not worry about rounding, it seems to me.  So I am changing my opinion here, and suggest that this PR that does checking = 0 is fine.  We can put it in the user docs that this is what we are doing.   
    
    Others please comment if you think this is an OK approach.


---

[GitHub] madlib issue #220: Add more stats to summary function

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/220
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/317/



---