You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by jingyimei <gi...@git.apache.org> on 2017/12/22 00:58:03 UTC
[GitHub] madlib pull request #220: Add more stats to summary function
GitHub user jingyimei opened a pull request:
https://github.com/apache/madlib/pull/220
Add more stats to summary function
This PR added the following statistics to madlib.summary():
positive values
negative values
zero values
95% confidence intervals on mean
User docs is updated due to adding new fields.
Besides, we rename 'row_count' to 'num_col_summarized' in summary() return result to eliminate confusion from another `row_count` in user_defined_summary_table.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jingyimei/madlib summary_more_stats
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/220.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #220
----
commit cbd3cd4c0af5005ff4ea7d82cab74325221d4311
Author: Jingyi Mei <jm...@...>
Date: 2017-12-20T13:18:05Z
Add more stats to summary function
This commit added the following statistics to madlib.summary():
positive values
negative values
zero values
95% confidence intervals on mean
User docs is updated due to adding new fields.
commit da7fea93b6fab72451643a0264b6e841448c9a4d
Author: Jingyi Mei <jm...@...>
Date: 2017-12-21T05:34:34Z
Rename 'row_count' to 'num_col_summarized' in summary() return result
Previously, when we run `SELECT * FROM madlib.summary(valid_inputs)`, it returns a
composite type containing a filed named `row_count`, which refers to the
number of rows in the output table.
when we run `SELECT * FROM user_defined_summary_table;`, it also
contains a column named `row_count`, which refers to number of rows for
the target column.
To eliminate the confusion, we rename the first `row_count` to
`num_col_summarized`, modify explanation and also update user doc.
----
---
[GitHub] madlib pull request #220: Add more stats to summary function
Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:
https://github.com/apache/madlib/pull/220#discussion_r158424408
--- Diff: src/ports/postgres/modules/summary/Summarizer.py_in ---
@@ -199,6 +200,22 @@ class Summarizer:
args['max_columns'] = ','.join([minmax_type('max', c) for c in cols])
args['ntile_columns'] = "array_to_string(array[NULL], ',')"
+
+ args['positive_columns'] = ','.join(["sum(case when {0} > 0 \
+ then 1 else 0 end)".format(c['attname'])
+ if c['typname'] in numeric_types
+ else 'NULL' for c in cols])
+
+ args["negative_columns"] = ','.join(["sum(case when {0} < 0 \
+ then 1 else 0 end)".format(c['attname'])
+ if c['typname'] in numeric_types
+ else 'NULL' for c in cols])
+
+ args["zero_columns"] = ','.join(["sum(case when {0} = 0 \
--- End diff --
If we are using float, maybe this will be more accurate.
---
[GitHub] madlib pull request #220: Add more stats to summary function
Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/220#discussion_r158412746
--- Diff: src/ports/postgres/modules/summary/Summarizer.py_in ---
@@ -199,6 +200,22 @@ class Summarizer:
args['max_columns'] = ','.join([minmax_type('max', c) for c in cols])
args['ntile_columns'] = "array_to_string(array[NULL], ',')"
+
+ args['positive_columns'] = ','.join(["sum(case when {0} > 0 \
+ then 1 else 0 end)".format(c['attname'])
+ if c['typname'] in numeric_types
+ else 'NULL' for c in cols])
+
+ args["negative_columns"] = ','.join(["sum(case when {0} < 0 \
+ then 1 else 0 end)".format(c['attname'])
+ if c['typname'] in numeric_types
+ else 'NULL' for c in cols])
+
+ args["zero_columns"] = ','.join(["sum(case when {0} = 0 \
--- End diff --
should we check equality to 0 exactly like this, or abs value < SMALL ?
---
[GitHub] madlib issue #220: Add more stats to summary function
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the issue:
https://github.com/apache/madlib/pull/220
This can be closed since merged by d025bb4609baeb7c7a1d136590780a8fafdee208.
---
[GitHub] madlib pull request #220: Add more stats to summary function
Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei closed the pull request at:
https://github.com/apache/madlib/pull/220
---
[GitHub] madlib pull request #220: Add more stats to summary function
Posted by orhankislal <gi...@git.apache.org>.
Github user orhankislal commented on a diff in the pull request:
https://github.com/apache/madlib/pull/220#discussion_r158568809
--- Diff: src/ports/postgres/modules/summary/Summarizer.py_in ---
@@ -199,6 +200,22 @@ class Summarizer:
args['max_columns'] = ','.join([minmax_type('max', c) for c in cols])
args['ntile_columns'] = "array_to_string(array[NULL], ',')"
+
+ args['positive_columns'] = ','.join(["sum(case when {0} > 0 \
+ then 1 else 0 end)".format(c['attname'])
+ if c['typname'] in numeric_types
+ else 'NULL' for c in cols])
+
+ args["negative_columns"] = ','.join(["sum(case when {0} < 0 \
+ then 1 else 0 end)".format(c['attname'])
+ if c['typname'] in numeric_types
+ else 'NULL' for c in cols])
+
+ args["zero_columns"] = ','.join(["sum(case when {0} = 0 \
--- End diff --
In graph algorithms such as SSSP and APSP, we used `EPSILON = 0.000001` for float comparisons.
---
[GitHub] madlib pull request #220: Add more stats to summary function
Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/220#discussion_r158572631
--- Diff: src/ports/postgres/modules/summary/Summarizer.py_in ---
@@ -199,6 +200,22 @@ class Summarizer:
args['max_columns'] = ','.join([minmax_type('max', c) for c in cols])
args['ntile_columns'] = "array_to_string(array[NULL], ',')"
+
+ args['positive_columns'] = ','.join(["sum(case when {0} > 0 \
+ then 1 else 0 end)".format(c['attname'])
+ if c['typname'] in numeric_types
+ else 'NULL' for c in cols])
+
+ args["negative_columns"] = ','.join(["sum(case when {0} < 0 \
+ then 1 else 0 end)".format(c['attname'])
+ if c['typname'] in numeric_types
+ else 'NULL' for c in cols])
+
+ args["zero_columns"] = ','.join(["sum(case when {0} = 0 \
--- End diff --
Orhan when you say "comparison" between FLOATs, then yes, we do need to consider rounding.
But if we are checking *explicit* equality to 0 in summary() function, then we should not worry about rounding, it seems to me. So I am changing my opinion here, and suggest that this PR that does checking = 0 is fine. We can put it in the user docs that this is what we are doing.
Others please comment if you think this is an OK approach.
---
[GitHub] madlib issue #220: Add more stats to summary function
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/220
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/317/
---