You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by orhankislal <gi...@git.apache.org> on 2018/05/08 00:01:29 UTC
[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...
GitHub user orhankislal opened a pull request:
https://github.com/apache/madlib/pull/269
Statistics: Add grouping support for correlation functions
JIRA: MADLIB-1128
This commit adds grouping support to correlation and covariance
functions in MADlib stats. Changes include relevant queries to do the
same.
This commit also has refactor changes to a helper function in
utilities.py_in.
Co-authored-by: Jingyi Mei <jm...@pivotal.io>
Co-authored-by: Nikhil Kak <nk...@pivotal.io>
Co-authored-by: Nandish Jayaram <nj...@apache.org>
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/madlib/madlib feature/correlation-grouping
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/269.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #269
----
commit e9dd9ae88d9d2acaea68c093396f6d600148ede4
Author: Orhan Kislal <ok...@...>
Date: 2018-05-07T23:53:51Z
Statistics: Add grouping support for correlation functions
JIRA: MADLIB-1128
This commit adds grouping support to correlation and covariance
functions in MADlib stats. Changes include relevant queries to do the
same.
This commit also has refactor changes to a helper function in
utilities.py_in.
Co-authored-by: Jingyi Mei <jm...@pivotal.io>
Co-authored-by: Nikhil Kak <nk...@pivotal.io>
Co-authored-by: Nandish Jayaram <nj...@apache.org>
----
---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:
https://github.com/apache/madlib/pull/269
(1)
```
DROP TABLE IF EXISTS example_data_output, example_data_output_summary;
SELECT madlib.correlation( 'example_data',
'example_data_output',
'temperature, humidity'
);
```
produces
```
INFO: ['temperature', 'humidity']
CONTEXT: PL/Python function "correlation"
-[ RECORD 1 ]----------------------------------------------------------------
correlation | Summary for 'correlation' function
| Output table = example_data_output
| Producing correlation for columns: ['temperature', 'humidity']
| Total run time = ('example_data_output', 2, 0.2568531036376953)
```
which looks OK, but
```
DROP TABLE IF EXISTS example_data_output, example_data_output_summary;
SELECT madlib.covariance( 'example_data',
'example_data_output',
'temperature, humidity'
);
```
produces
```
INFO: ['temperature', 'humidity']
CONTEXT: PL/Python function "covariance"
-[ RECORD 1 ]----------------------------------------------------------------
covariance | Summary for 'correlation' function
| Output table = example_data_output
| Producing correlation for columns: ['temperature', 'humidity']
| Total run time = ('example_data_output', 2, 0.27153897285461426)
```
but it should say `Summary for 'covariance' function`
(2)
Above output should list groups if grouping is used, currently it does not.
(3)
Is it intentional to print the INFO message out every time that the correlation/covariance functions
is run?
---
[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/269#discussion_r186714879
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -800,26 +800,39 @@ def get_table_qualified_col_str(tbl_name, col_list):
for col in col_list])
+def validate_reserved_and_target_cols_intersection(source_table,
--- End diff --
A few suggestions here:
- If this is a validation function then it's better in `validate_args.py_in`
- This can be generalized for any reserved names and not just columns. This would require moving out the `cols_in_tbl_valid`, which in IMO is cleaner since the function does double duty right now.
- Could we please have a shorter name for this - something that does not include the "how" (i.e. intersection)? Maybe `does_exclude_reserved`? The argument names could also be just `targets` and `reserved` to not restrict them to column names.
---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:
https://github.com/apache/madlib/pull/269
Thanks for the explanation.
I pushed one additional small commit that changes the name of the module from "Pearson's Correlation" to "Covariance and Correlation" , which is more descriptive of that what this module does.
---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/269
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/474/
---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/269
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/479/
---
[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/269#discussion_r186729249
--- Diff: src/ports/postgres/modules/stats/correlation.py_in ---
@@ -166,101 +203,184 @@ def _populate_output_table(schema_madlib, source_table, output_table,
start = time()
col_len = len(col_names)
col_names_as_text_array = py_list_to_sql_string(col_names, "varchar")
- temp_table = unique_string()
+ # Create unique strings to be used in queries.
+ coalesced_col_array = unique_string(desp='coalesced_col_array')
+ mean_col = unique_string(desp='mean')
if get_cov:
- function_name = "Covariance"
agg_str = """
(CASE WHEN count(*) > 0
- THEN {0}.array_scalar_mult({0}.covariance_agg(x, mean),
+ THEN {0}.array_scalar_mult({0}.covariance_agg({1}, {2}),
1.0 / count(*)::double precision)
ELSE NULL
- END) """.format(schema_madlib)
+ END) """.format(schema_madlib, coalesced_col_array, mean_col)
else:
- function_name = "Correlation"
- agg_str = "{0}.correlation_agg(x, mean)".format(schema_madlib)
+ agg_str = "{0}.correlation_agg({1}, {2})".format(schema_madlib,
+ coalesced_col_array,
+ mean_col)
cols = ','.join(["coalesce({0}, {1})".format(col, add_postfix(col, "_avg"))
for col in col_names])
avgs = ','.join(["avg({0}) AS {1}".format(col, add_postfix(col, "_avg"))
for col in col_names])
avg_array = ','.join([str(add_postfix(col, "_avg")) for col in col_names])
- # actual computation
- sql1 = """
-
- CREATE TEMP TABLE {temp_table} AS
- SELECT
- count(*) AS tot_cnt,
- mean,
- {agg_str} as cor_mat
- FROM
- (
- SELECT ARRAY[ {cols} ] AS x,
- ARRAY [ {avg_array} ] AS mean
- FROM {source_table},
+ # Create unique strings to be used in queries.
+ tot_cnt = unique_string(desp='tot_cnt')
+ cor_mat = unique_string(desp='cor_mat')
+ temp_output_table = unique_string(desp='temp_output')
+ subquery1 = unique_string(desp='subq1')
+ subquery2 = unique_string(desp='subq2')
+
+ grouping_cols_comma = ''
+ subquery_grouping_cols_comma = ''
+ inner_group_by = ''
+ # Cross join if there are no groups to consider
+ join_condition = ' ON (1=1) '
+
+ if grouping_cols:
+ group_col_list = split_quoted_delimited_str(grouping_cols)
+ grouping_cols_comma = add_postfix(grouping_cols, ', ')
+ subquery_grouping_cols_comma = get_table_qualified_col_str(
+ subquery2, group_col_list) + " , "
+
+ inner_group_by = " GROUP BY {0}".format(grouping_cols)
+ join_condition = " USING ({0})".format(grouping_cols)
+
+ create_temp_output_table_query = """
+ CREATE TEMP TABLE {temp_output_table} AS
+ SELECT
+ {subquery_grouping_cols_comma}
+ count(*) AS {tot_cnt},
+ {mean_col},
+ {agg_str} AS {cor_mat}
+ FROM
(
- SELECT {avgs}
- FROM {source_table}
- )sub1
- ) sub2
- GROUP BY mean
- """.format(**locals())
+ SELECT {grouping_cols_comma}
+ ARRAY[ {cols} ] AS {coalesced_col_array},
+ ARRAY [ {avg_array} ] AS {mean_col}
- plpy.execute(sql1)
-
- # create summary table
+ FROM {source_table}
+ JOIN
+ (
+ SELECT {grouping_cols_comma} {avgs}
+ FROM {source_table}
+ {inner_group_by}
+ ) {subquery1}
+ {join_condition}
+ ) {subquery2}
+ GROUP BY {grouping_cols_comma} {mean_col}
+ """.format(**locals())
+ plpy.execute(create_temp_output_table_query)
+
+ # Prepare the query for converting the matrix into the lower triangle
+ deconstruction_query = _create_deconstruction_query(schema_madlib,
+ col_names,
+ grouping_cols,
+ temp_output_table,
+ cor_mat)
+
+ variable_subquery = unique_string(desp='variable_subq')
+ matrix_subquery = unique_string(desp='matrix_subq')
+ # create output table
+ plpy.info(col_names)
+ create_output_table_query = """
+ CREATE TABLE {output_table} AS
+ SELECT {grouping_cols_comma} column_position, variable, {target_cols}
+ FROM
+ (
+ SELECT
+ generate_series(1, {num_cols}) AS column_position,
+ unnest({col_names_as_text_array}) AS variable
+ ) {variable_subquery}
+ JOIN
+ (
+ {deconstruction_query}
+ ) {matrix_subquery}
+ USING (column_position)
+ """.format( num_cols=len(col_names),
+ target_cols=' , '.join(col_names),
+ **locals())
+ plpy.execute(create_output_table_query)
+
+ # create summary table
summary_table = add_postfix(output_table, "_summary")
- q_summary = """
+ create_summary_table_query = """
CREATE TABLE {summary_table} AS
SELECT
'{function_name}'::varchar AS method,
'{source_table}'::varchar AS source,
'{output_table}'::varchar AS output_table,
{col_names_as_text_array} AS column_names,
- mean AS mean_vector,
- tot_cnt AS total_rows_processed
- FROM {temp_table}
+ {grouping_cols_comma}
+ {mean_col} AS mean_vector,
+ {tot_cnt} AS total_rows_processed
+ FROM {temp_output_table}
""".format(**locals())
-
- plpy.execute(q_summary)
-
- # create output table
- variable_list = []
- for k, c in enumerate(col_names):
- if k % 10 == 0:
- variable_list.append("\n ")
- variable_list.append(str(c) + " float8")
- variable_list.append(",")
- variable_list_str = ''.join(variable_list[:-1]) # remove the last comma
-
- plpy.execute("""
- CREATE TABLE {output_table} AS
- SELECT
- *
- FROM
- (
- SELECT
- generate_series(1, {num_cols}) AS column_position,
- unnest({col_names_as_text_array}) AS variable
- ) variable_subq
- JOIN
- (
- SELECT
- *
- FROM
- {schema_madlib}.__deconstruct_lower_triangle(
- (SELECT cor_mat FROM {temp_table})
- )
- AS deconstructed(column_position integer, {variable_list_str})
- ) matrix_subq
- USING (column_position)
- """.format(num_cols=len(col_names), **locals()))
+ plpy.execute(create_summary_table_query)
# clean up and return
- plpy.execute("DROP TABLE {temp_table}".format(**locals()))
+ plpy.execute("DROP TABLE IF EXISTS {temp_output_table}".format(**locals()))
+
end = time()
return (output_table, len(col_names), end - start)
# ------------------------------------------------------------------------------
+def _create_deconstruction_query(schema_madlib, col_names, grouping_cols,
+ temp_output_table, cor_mat):
+ """
+ Creates the query to convert the matrix into the lower-traingular format.
+
+ Args:
+ @param schema_madlib Schema of MADlib
+ @param col_names Name of all columns to place in output table
+ @param grouping_cols Name of all columns to be used for grouping
+ @param temp_output_table Name of the temporary table that contains
+ the matrix to deconstruct
+ @param cor_mat Name of column that containss the matrix
+ to deconstruct
+
+ Returns:
+ String (SQL querry for deconstructing the matrix)
+ """
+ # The matrix that holds the PCC computation must be converted to a
+ # table capturing all pair wise PCC values. That is done using
+ # a UDF named __deconstruct_lower_triangle.
+ # With grouping, calling that UDF becomes a lot more complex, so
+ # construct the query accordingly.
+
+ variable_list = []
+ for k, c in enumerate(col_names):
+ if k % 10 == 0:
+ variable_list.append("\n ")
+ variable_list.append(str(c) + " float8")
+ variable_list.append(",")
+ variable_list_str = ''.join(variable_list[:-1]) # remove the last comma
+
--- End diff --
I realize that above 8 lines are from existing code but wondering if below is easier to understand and read:
```
COL_WIDTH = 10
# split the col_names to equal size sets with newline between to prevent a long query
# Build a 2d array of the col_names, each inner array with COL_WIDTH number of names.
col_names_split = [col_names[x : x + COL_WIDTH] for x in range(0, len(col_names), COL_WIDTH)]
variable_list_str = ',\n '.join([', '.join(i) for i in col_names_split])
```
---
[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/269#discussion_r186722850
--- Diff: src/ports/postgres/modules/stats/correlation.py_in ---
@@ -166,101 +203,184 @@ def _populate_output_table(schema_madlib, source_table, output_table,
start = time()
col_len = len(col_names)
col_names_as_text_array = py_list_to_sql_string(col_names, "varchar")
- temp_table = unique_string()
+ # Create unique strings to be used in queries.
+ coalesced_col_array = unique_string(desp='coalesced_col_array')
+ mean_col = unique_string(desp='mean')
if get_cov:
- function_name = "Covariance"
agg_str = """
(CASE WHEN count(*) > 0
- THEN {0}.array_scalar_mult({0}.covariance_agg(x, mean),
+ THEN {0}.array_scalar_mult({0}.covariance_agg({1}, {2}),
1.0 / count(*)::double precision)
ELSE NULL
- END) """.format(schema_madlib)
+ END) """.format(schema_madlib, coalesced_col_array, mean_col)
else:
- function_name = "Correlation"
- agg_str = "{0}.correlation_agg(x, mean)".format(schema_madlib)
+ agg_str = "{0}.correlation_agg({1}, {2})".format(schema_madlib,
+ coalesced_col_array,
+ mean_col)
cols = ','.join(["coalesce({0}, {1})".format(col, add_postfix(col, "_avg"))
for col in col_names])
avgs = ','.join(["avg({0}) AS {1}".format(col, add_postfix(col, "_avg"))
for col in col_names])
avg_array = ','.join([str(add_postfix(col, "_avg")) for col in col_names])
- # actual computation
- sql1 = """
-
- CREATE TEMP TABLE {temp_table} AS
- SELECT
- count(*) AS tot_cnt,
- mean,
- {agg_str} as cor_mat
- FROM
- (
- SELECT ARRAY[ {cols} ] AS x,
- ARRAY [ {avg_array} ] AS mean
- FROM {source_table},
+ # Create unique strings to be used in queries.
+ tot_cnt = unique_string(desp='tot_cnt')
+ cor_mat = unique_string(desp='cor_mat')
+ temp_output_table = unique_string(desp='temp_output')
+ subquery1 = unique_string(desp='subq1')
+ subquery2 = unique_string(desp='subq2')
+
+ grouping_cols_comma = ''
+ subquery_grouping_cols_comma = ''
+ inner_group_by = ''
+ # Cross join if there are no groups to consider
+ join_condition = ' ON (1=1) '
+
+ if grouping_cols:
+ group_col_list = split_quoted_delimited_str(grouping_cols)
+ grouping_cols_comma = add_postfix(grouping_cols, ', ')
+ subquery_grouping_cols_comma = get_table_qualified_col_str(
+ subquery2, group_col_list) + " , "
+
+ inner_group_by = " GROUP BY {0}".format(grouping_cols)
+ join_condition = " USING ({0})".format(grouping_cols)
+
+ create_temp_output_table_query = """
+ CREATE TEMP TABLE {temp_output_table} AS
+ SELECT
+ {subquery_grouping_cols_comma}
+ count(*) AS {tot_cnt},
+ {mean_col},
+ {agg_str} AS {cor_mat}
+ FROM
(
- SELECT {avgs}
- FROM {source_table}
- )sub1
- ) sub2
- GROUP BY mean
- """.format(**locals())
+ SELECT {grouping_cols_comma}
+ ARRAY[ {cols} ] AS {coalesced_col_array},
+ ARRAY [ {avg_array} ] AS {mean_col}
- plpy.execute(sql1)
-
- # create summary table
+ FROM {source_table}
+ JOIN
+ (
+ SELECT {grouping_cols_comma} {avgs}
+ FROM {source_table}
+ {inner_group_by}
+ ) {subquery1}
+ {join_condition}
+ ) {subquery2}
+ GROUP BY {grouping_cols_comma} {mean_col}
+ """.format(**locals())
+ plpy.execute(create_temp_output_table_query)
+
+ # Prepare the query for converting the matrix into the lower triangle
+ deconstruction_query = _create_deconstruction_query(schema_madlib,
+ col_names,
+ grouping_cols,
+ temp_output_table,
+ cor_mat)
+
+ variable_subquery = unique_string(desp='variable_subq')
+ matrix_subquery = unique_string(desp='matrix_subq')
+ # create output table
+ plpy.info(col_names)
--- End diff --
Info line can be deleted
---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Posted by orhankislal <gi...@git.apache.org>.
Github user orhankislal commented on the issue:
https://github.com/apache/madlib/pull/269
Thanks for your comments Frank. Regarding the accuracy, I have tested the code with multiple groups from multiple grouping columns with hand-calculated values (as seen in the install-check). I have also tried a larger dataset (600 columns). I have duplicated the data to create multiple columns and checked to see if the results match across groups. Please let us know if you have any other suggestions.
---
[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/269#discussion_r186711399
--- Diff: src/ports/postgres/modules/stats/correlation.py_in ---
@@ -34,9 +40,18 @@ def correlation(schema_madlib, source_table, output_table,
Tuple (output table name, number of columns, time for computation)
"""
with MinWarning("info" if verbose else "error"):
- _validate_corr_arg(source_table, output_table)
+ if get_cov:
--- End diff --
One-liner might be cleaner: `function_name = 'Covariance' if get_cov else 'Correlation'`
---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/269
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/470/
---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:
https://github.com/apache/madlib/pull/269
Also wondering what the nature of the testing is to ensure that covariance and correlation are being calculated properly with the added groups?
---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/269
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/478/
---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/269
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/473/
---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/269
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/476/
---
[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/269#discussion_r186715591
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -800,26 +800,39 @@ def get_table_qualified_col_str(tbl_name, col_list):
for col in col_list])
+def validate_reserved_and_target_cols_intersection(source_table,
+ target_cols_list,
+ reserved_cols_list,
+ module_name):
+ """
+ Function to check if any target column name is part of reserved column
+ names. Throw an error if it is.
+ """
+ cols_in_tbl_valid(source_table,
+ target_cols_list,
+ module_name)
+ intersect = frozenset(target_cols_list).intersection(
+ frozenset(reserved_cols_list))
+ _assert(len(intersect) == 0,
+ "{0} error: Conflicting column names.\n"
+ "Some predefined keyword(s) ({1}) are not allowed "
+ "for column names in module input params.".format(
+ module_name, ', '.join(intersect)))
+
def get_grouping_col_str(schema_madlib, module_name, reserved_cols,
source_table, grouping_col):
if grouping_col and grouping_col.lower() != 'null':
grouping_col_array = _string_to_array_with_quotes(grouping_col)
- cols_in_tbl_valid(source_table,
- grouping_col_array,
- module_name)
- intersect = frozenset(
- _string_to_array(grouping_col)).intersection(frozenset(reserved_cols))
- _assert(len(intersect) == 0,
- "{0} error: Conflicting grouping column name.\n"
- "Some predefined keyword(s) ({1}) are not allowed "
- "for grouping column names!".format(module_name, ', '.join(intersect)))
-
- grouping_list = [i + "::text"
- for i in explicit_bool_to_text(
- source_table,
- grouping_col_array,
- schema_madlib)]
- grouping_str = ','.join(grouping_list)
+ validate_reserved_and_target_cols_intersection(source_table,
+ grouping_col_array,
+ reserved_cols,
+ module_name)
+ grouping_list_with_text = [i + "::text" for i in
+ explicit_bool_to_text(
+ source_table,
+ grouping_col_array,
+ schema_madlib)]
+ grouping_str = ','.join(grouping_list_with_text)
--- End diff --
I suggest a minor edit for better readability:
```
grp_list_w_cast = explicit_bool_to_text(source_table, grp_array, schema_madlib)
grp_str = ', '.join(i + "::text" for i in grp_list_w_cast)
```
---
[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...
Posted by orhankislal <gi...@git.apache.org>.
Github user orhankislal closed the pull request at:
https://github.com/apache/madlib/pull/269
---