You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by hpandeycodeit <gi...@git.apache.org> on 2018/07/03 22:22:07 UTC
[GitHub] madlib pull request #288: Madlib 1239
GitHub user hpandeycodeit opened a pull request:
https://github.com/apache/madlib/pull/288
Madlib 1239
JIRA: 1239
Added a new module cols_vec which Converts features from multiple columns of an input table into a feature array in a single column.
Following files are committed:
cols2vec.py_in
cols2vec.sql_in
test/cols2vec.sql_in
Modules.yml
For special characters handling, using the py_list_to_string with "long_format = False".
Also, split_quoted_delimited_str which quotes each element of the array.
Tests with special characters are added in the install check.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hpandeycodeit/incubator-madlib MADLIB_1239
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/288.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #288
----
commit 3b5a7780dee45edb215351879c892945d2c59c4c
Author: hpandeycodeit <hp...@...>
Date: 2018-06-15T08:33:27Z
Changes for Jira: 1239, Converts features from multiple columns into a feature array
commit 06bef61bffe71bae5bb9956c650d39ae7e0d5ef8
Author: hpandeycodeit <hp...@...>
Date: 2018-06-15T08:41:10Z
added Module cols_vec
commit 40f08e1f17e53fc8b3fddb873f31f35be0aeb7f6
Author: hpandeycodeit <hp...@...>
Date: 2018-06-15T08:43:23Z
added Module cols_vec
commit 3050607311059acde59d8a309eb10bb745a31469
Author: hpandeycodeit <hp...@...>
Date: 2018-06-18T22:13:31Z
Fixed the test file for install-check failure.
commit 7dce06b3e7380ac8a97021943246450568b6e242
Author: hpandeycodeit <hp...@...>
Date: 2018-07-03T22:16:36Z
Changes for Jira: 1239, Converts features from multiple columns into a feature array
----
---
[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/288
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/536/
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r200891497
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,104 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.utilities import _assert
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+ _assert(
+ columns_exist_in_table(
+ source_table, split_quoted_delimited_str(list_of_features)),
+ "Invalid columns to list of features {0}".format(list_of_features))
+
+ if cols_to_output and cols_to_output.strip() != '*':
+ _assert(
+ columns_exist_in_table(
+ source_table, _string_to_array(cols_to_output)),
+ "Invalid columns to output list {0}".format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
+ """
+ Args:
+ @param schema_madlib: Name of MADlib schema
+ @param model: Name of table containing the tree model
+ @param source_table: Name of table containing prediction data
+ @param output_table: Name of table to output the results
+ @param list_of_features: Comma-separated string of column names or
+ expressions to put into feature array.
+ Can also be a '*' implying all columns
+ are to be put into feature array.
+ @param list_of_features_to_exclude: Comma-separated string of column names
+ to exclude from the feature array
+ @param cols_to_output: Comma-separated string of column names
+ from the source table to keep in the output table,
+ in addition to the feature array.
+
+ Returns:
+ None
+
+ """
+
+ with MinWarning('warning'):
+ validate_cols2vec_args(source_table, output_table, list_of_features,
+ list_of_features_to_exclude, cols_to_output, **kwargs)
+
+ all_cols = ''
+ feature_cols = ''
+ feature_list = ''
+ if list_of_features.strip() == '*':
+ all_cols = get_cols(source_table, schema_madlib)
+ all_col_set = set(list(all_cols))
+ exclude_set = set(split_quoted_delimited_str(
+ list_of_features_to_exclude))
+ feature_list = list(all_col_set - exclude_set)
+ else:
+ feature_list = split_quoted_delimited_str(list_of_features)
+
+ feature_cols = py_list_to_sql_string(
+ list(feature_list), "text", False)
+ filtered_list_of_features = ",".join(
--- End diff --
Above changes are done as suggested.
---
[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/288
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/535/
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r199982717
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,110 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+
+ if list_of_features.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, split_quoted_delimited_str(list_of_features)):
+ plpy.error(
+ "Invalid columns to list_of_features ({0})".format(list_of_features))
+
+ if cols_to_output and cols_to_output.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, _string_to_array(cols_to_output)):
+ plpy.error("Invalid columns to output list ({0})".
+ format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
+ """
+ Args:
+ @param schema_madlib: Name of MADlib schema
+ @param model: Name of table containing the tree model
+ @param source_table: Name of table containing prediction data
+ @param output_table: Name of table to output the results
+ @param list_of_features: Comma-separated string of column names or
+ expressions to put into feature array.
+ Can also be a '*' implying all columns
+ are to be put into feature array.
+ @param list_of_features_to_exclude: Comma-separated string of column names
+ to exclude from the feature array
+ @param cols_to_output: Comma-separated string of column names
+ from the source table to keep in the output table,
+ in addition to the feature array.
+
+ Returns:
+ None
+
+ """
+
+ with MinWarning('warning'):
+ validate_cols2vec_args(source_table, output_table, list_of_features,
+ list_of_features_to_exclude, cols_to_output, **kwargs)
+
+ all_cols = ''
+ feature_cols = ''
+ if list_of_features.strip() == '*':
--- End diff --
The `if` and `else` blocks seem to be very similar except for the source of the `all_cols/feature_list`. I suggest using the if switch only for populating the source columns. Other statements can be moved out of the if.
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r200790240
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,104 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.utilities import _assert
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+ _assert(
+ columns_exist_in_table(
+ source_table, split_quoted_delimited_str(list_of_features)),
+ "Invalid columns to list of features {0}".format(list_of_features))
+
+ if cols_to_output and cols_to_output.strip() != '*':
+ _assert(
+ columns_exist_in_table(
+ source_table, _string_to_array(cols_to_output)),
+ "Invalid columns to output list {0}".format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
+ """
+ Args:
+ @param schema_madlib: Name of MADlib schema
+ @param model: Name of table containing the tree model
+ @param source_table: Name of table containing prediction data
+ @param output_table: Name of table to output the results
+ @param list_of_features: Comma-separated string of column names or
+ expressions to put into feature array.
+ Can also be a '*' implying all columns
+ are to be put into feature array.
+ @param list_of_features_to_exclude: Comma-separated string of column names
+ to exclude from the feature array
+ @param cols_to_output: Comma-separated string of column names
+ from the source table to keep in the output table,
+ in addition to the feature array.
+
+ Returns:
+ None
+
+ """
+
+ with MinWarning('warning'):
+ validate_cols2vec_args(source_table, output_table, list_of_features,
+ list_of_features_to_exclude, cols_to_output, **kwargs)
+
+ all_cols = ''
--- End diff --
We can delete lines 70, 71, 72 since we don't need those anymore.
---
[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/288
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/537/
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r199981970
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,110 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
--- End diff --
The checks below are better expressed as assert statements using `_assert(...)`
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r200787876
--- Diff: src/config/Modules.yml ---
@@ -50,3 +50,4 @@ modules:
- name: validation
depends: ['array_ops', 'regress']
- name: stemmer
+ - name: cols_vec
--- End diff --
Discussed it with @fmcquillan99 and I will move it under Utilities module in next commit.
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r199983644
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,110 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+
+ if list_of_features.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, split_quoted_delimited_str(list_of_features)):
+ plpy.error(
+ "Invalid columns to list_of_features ({0})".format(list_of_features))
+
+ if cols_to_output and cols_to_output.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, _string_to_array(cols_to_output)):
+ plpy.error("Invalid columns to output list ({0})".
+ format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
+ """
+ Args:
+ @param schema_madlib: Name of MADlib schema
+ @param model: Name of table containing the tree model
+ @param source_table: Name of table containing prediction data
+ @param output_table: Name of table to output the results
+ @param list_of_features: Comma-separated string of column names or
+ expressions to put into feature array.
+ Can also be a '*' implying all columns
+ are to be put into feature array.
+ @param list_of_features_to_exclude: Comma-separated string of column names
+ to exclude from the feature array
+ @param cols_to_output: Comma-separated string of column names
+ from the source table to keep in the output table,
+ in addition to the feature array.
+
+ Returns:
+ None
+
+ """
+
+ with MinWarning('warning'):
+ validate_cols2vec_args(source_table, output_table, list_of_features,
+ list_of_features_to_exclude, cols_to_output, **kwargs)
+
+ all_cols = ''
+ feature_cols = ''
+ if list_of_features.strip() == '*':
+ all_cols = get_cols(source_table, schema_madlib)
+ all_col_set = set(list(all_cols))
+ exclude_set = set(split_quoted_delimited_str(
+ list_of_features_to_exclude))
+ feature_set = all_col_set - exclude_set
+ feature_cols = py_list_to_sql_string(
+ list(feature_set), "text", False)
+ filtered_list_of_features = ",".join(
+ feat for feat in list(feature_set))
+ else:
+ feature_list = split_quoted_delimited_str(list_of_features)
+ feature_exclude = split_quoted_delimited_str(
+ list_of_features_to_exclude)
+ return_set = set(feature_list) - set(feature_exclude)
+ feature_cols = py_list_to_sql_string(
+ list(return_set), "text", False)
+ filtered_list_of_features = ",".join(
+ feat for feat in feature_list if feat in return_set)
+
+ output_cols = ''
+ if cols_to_output:
+ output_cols_list = [', '.join(get_cols(source_table, schema_madlib)) if col == '*' else col
+ for col in split_quoted_delimited_str(cols_to_output)]
+ output_cols = ', '.join(output_cols_list) + ","
+
+ plpy.execute("""
+ CREATE TABLE {output_table} AS
+ select {output_cols}
+ array[{filtered_list_of_features}] as feature_vector
+ from {source_table}
+ """.format(**locals()))
+
+ plpy.execute("""
--- End diff --
I haven't understood the need for the summary table. Is it just to record the features combined in the array? Can we provide that as an output of the function? Creating a table just to record that one parameter seems unnecessary.
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r200790286
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,104 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.utilities import _assert
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+ _assert(
+ columns_exist_in_table(
+ source_table, split_quoted_delimited_str(list_of_features)),
+ "Invalid columns to list of features {0}".format(list_of_features))
+
+ if cols_to_output and cols_to_output.strip() != '*':
+ _assert(
+ columns_exist_in_table(
+ source_table, _string_to_array(cols_to_output)),
+ "Invalid columns to output list {0}".format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
+ """
+ Args:
+ @param schema_madlib: Name of MADlib schema
+ @param model: Name of table containing the tree model
+ @param source_table: Name of table containing prediction data
+ @param output_table: Name of table to output the results
+ @param list_of_features: Comma-separated string of column names or
+ expressions to put into feature array.
+ Can also be a '*' implying all columns
+ are to be put into feature array.
+ @param list_of_features_to_exclude: Comma-separated string of column names
+ to exclude from the feature array
+ @param cols_to_output: Comma-separated string of column names
+ from the source table to keep in the output table,
+ in addition to the feature array.
+
+ Returns:
+ None
+
+ """
+
+ with MinWarning('warning'):
+ validate_cols2vec_args(source_table, output_table, list_of_features,
+ list_of_features_to_exclude, cols_to_output, **kwargs)
+
+ all_cols = ''
+ feature_cols = ''
+ feature_list = ''
+ if list_of_features.strip() == '*':
+ all_cols = get_cols(source_table, schema_madlib)
+ all_col_set = set(list(all_cols))
+ exclude_set = set(split_quoted_delimited_str(
+ list_of_features_to_exclude))
+ feature_list = list(all_col_set - exclude_set)
+ else:
+ feature_list = split_quoted_delimited_str(list_of_features)
+
+ feature_cols = py_list_to_sql_string(
+ list(feature_list), "text", False)
+ filtered_list_of_features = ",".join(
--- End diff --
`filtered_list_of_features = ",".join(feature_list)`
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r199982049
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,110 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+
+ if list_of_features.strip() != '*':
--- End diff --
Please combine this with the above if statement
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r199983404
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,110 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+
+ if list_of_features.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, split_quoted_delimited_str(list_of_features)):
+ plpy.error(
+ "Invalid columns to list_of_features ({0})".format(list_of_features))
+
+ if cols_to_output and cols_to_output.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, _string_to_array(cols_to_output)):
+ plpy.error("Invalid columns to output list ({0})".
+ format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
+ """
+ Args:
+ @param schema_madlib: Name of MADlib schema
+ @param model: Name of table containing the tree model
+ @param source_table: Name of table containing prediction data
+ @param output_table: Name of table to output the results
+ @param list_of_features: Comma-separated string of column names or
+ expressions to put into feature array.
+ Can also be a '*' implying all columns
+ are to be put into feature array.
+ @param list_of_features_to_exclude: Comma-separated string of column names
+ to exclude from the feature array
+ @param cols_to_output: Comma-separated string of column names
+ from the source table to keep in the output table,
+ in addition to the feature array.
+
+ Returns:
+ None
+
+ """
+
+ with MinWarning('warning'):
+ validate_cols2vec_args(source_table, output_table, list_of_features,
+ list_of_features_to_exclude, cols_to_output, **kwargs)
+
+ all_cols = ''
+ feature_cols = ''
+ if list_of_features.strip() == '*':
+ all_cols = get_cols(source_table, schema_madlib)
+ all_col_set = set(list(all_cols))
+ exclude_set = set(split_quoted_delimited_str(
+ list_of_features_to_exclude))
+ feature_set = all_col_set - exclude_set
+ feature_cols = py_list_to_sql_string(
+ list(feature_set), "text", False)
+ filtered_list_of_features = ",".join(
+ feat for feat in list(feature_set))
+ else:
+ feature_list = split_quoted_delimited_str(list_of_features)
+ feature_exclude = split_quoted_delimited_str(
+ list_of_features_to_exclude)
+ return_set = set(feature_list) - set(feature_exclude)
--- End diff --
The order of the features are lost in this operation.
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r200790203
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,104 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.utilities import _assert
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+ _assert(
+ columns_exist_in_table(
+ source_table, split_quoted_delimited_str(list_of_features)),
+ "Invalid columns to list of features {0}".format(list_of_features))
+
+ if cols_to_output and cols_to_output.strip() != '*':
+ _assert(
+ columns_exist_in_table(
+ source_table, _string_to_array(cols_to_output)),
+ "Invalid columns to output list {0}".format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
+ """
+ Args:
+ @param schema_madlib: Name of MADlib schema
+ @param model: Name of table containing the tree model
+ @param source_table: Name of table containing prediction data
+ @param output_table: Name of table to output the results
+ @param list_of_features: Comma-separated string of column names or
+ expressions to put into feature array.
+ Can also be a '*' implying all columns
+ are to be put into feature array.
+ @param list_of_features_to_exclude: Comma-separated string of column names
+ to exclude from the feature array
+ @param cols_to_output: Comma-separated string of column names
+ from the source table to keep in the output table,
+ in addition to the feature array.
+
+ Returns:
+ None
+
+ """
+
+ with MinWarning('warning'):
+ validate_cols2vec_args(source_table, output_table, list_of_features,
+ list_of_features_to_exclude, cols_to_output, **kwargs)
+
+ all_cols = ''
+ feature_cols = ''
+ feature_list = ''
+ if list_of_features.strip() == '*':
+ all_cols = get_cols(source_table, schema_madlib)
+ all_col_set = set(list(all_cols))
--- End diff --
The order of the columns (retained by `get_cols`) is lost here. I suggest:
```
feature_list = [col for col in all_cols if col not in exclude_set]
```
---
[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/288
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/539/
---
[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...
Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:
https://github.com/apache/madlib/pull/288
Since we are writing out a summary table, may as well add more info in it.
{code}
A summary table named <out_table>_summary is also created at the same time, which has the following columns:
source_table TEXT. Source table name.
feature_names TEXT[]. Array of names of features.
total_rows_processed INTEGER. Total numbers of rows processed.
total_rows_skipped INTEGER. Total numbers of rows skipped due to failures.
{code}
Is this do-able @hpandeycodeit ?
---
[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...
Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:
https://github.com/apache/madlib/pull/288
update my comment above to remove the rows processed and skipped.
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r200510366
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,110 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+
+ if list_of_features.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, split_quoted_delimited_str(list_of_features)):
+ plpy.error(
+ "Invalid columns to list_of_features ({0})".format(list_of_features))
+
+ if cols_to_output and cols_to_output.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, _string_to_array(cols_to_output)):
+ plpy.error("Invalid columns to output list ({0})".
+ format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
+ """
+ Args:
+ @param schema_madlib: Name of MADlib schema
+ @param model: Name of table containing the tree model
+ @param source_table: Name of table containing prediction data
+ @param output_table: Name of table to output the results
+ @param list_of_features: Comma-separated string of column names or
+ expressions to put into feature array.
+ Can also be a '*' implying all columns
+ are to be put into feature array.
+ @param list_of_features_to_exclude: Comma-separated string of column names
+ to exclude from the feature array
+ @param cols_to_output: Comma-separated string of column names
+ from the source table to keep in the output table,
+ in addition to the feature array.
+
+ Returns:
+ None
+
+ """
+
+ with MinWarning('warning'):
+ validate_cols2vec_args(source_table, output_table, list_of_features,
+ list_of_features_to_exclude, cols_to_output, **kwargs)
+
+ all_cols = ''
+ feature_cols = ''
+ if list_of_features.strip() == '*':
+ all_cols = get_cols(source_table, schema_madlib)
+ all_col_set = set(list(all_cols))
+ exclude_set = set(split_quoted_delimited_str(
+ list_of_features_to_exclude))
+ feature_set = all_col_set - exclude_set
+ feature_cols = py_list_to_sql_string(
+ list(feature_set), "text", False)
+ filtered_list_of_features = ",".join(
+ feat for feat in list(feature_set))
+ else:
+ feature_list = split_quoted_delimited_str(list_of_features)
+ feature_exclude = split_quoted_delimited_str(
+ list_of_features_to_exclude)
+ return_set = set(feature_list) - set(feature_exclude)
+ feature_cols = py_list_to_sql_string(
+ list(return_set), "text", False)
+ filtered_list_of_features = ",".join(
+ feat for feat in feature_list if feat in return_set)
+
+ output_cols = ''
+ if cols_to_output:
+ output_cols_list = [', '.join(get_cols(source_table, schema_madlib)) if col == '*' else col
+ for col in split_quoted_delimited_str(cols_to_output)]
+ output_cols = ', '.join(output_cols_list) + ","
+
+ plpy.execute("""
+ CREATE TABLE {output_table} AS
+ select {output_cols}
+ array[{filtered_list_of_features}] as feature_vector
+ from {source_table}
+ """.format(**locals()))
+
+ plpy.execute("""
--- End diff --
If there are 1000+ columns which you want to keep track of, then saving the array or col names in a summary tables might be convenient. It is not ideal but should not be onerous to the user and they can ignore the summary table if they don't care about it.
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r200444598
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,110 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+
+ if list_of_features.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, split_quoted_delimited_str(list_of_features)):
+ plpy.error(
+ "Invalid columns to list_of_features ({0})".format(list_of_features))
+
+ if cols_to_output and cols_to_output.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, _string_to_array(cols_to_output)):
+ plpy.error("Invalid columns to output list ({0})".
+ format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
+ """
+ Args:
+ @param schema_madlib: Name of MADlib schema
+ @param model: Name of table containing the tree model
+ @param source_table: Name of table containing prediction data
+ @param output_table: Name of table to output the results
+ @param list_of_features: Comma-separated string of column names or
+ expressions to put into feature array.
+ Can also be a '*' implying all columns
+ are to be put into feature array.
+ @param list_of_features_to_exclude: Comma-separated string of column names
+ to exclude from the feature array
+ @param cols_to_output: Comma-separated string of column names
+ from the source table to keep in the output table,
+ in addition to the feature array.
+
+ Returns:
+ None
+
+ """
+
+ with MinWarning('warning'):
+ validate_cols2vec_args(source_table, output_table, list_of_features,
+ list_of_features_to_exclude, cols_to_output, **kwargs)
+
+ all_cols = ''
+ feature_cols = ''
+ if list_of_features.strip() == '*':
+ all_cols = get_cols(source_table, schema_madlib)
+ all_col_set = set(list(all_cols))
+ exclude_set = set(split_quoted_delimited_str(
+ list_of_features_to_exclude))
+ feature_set = all_col_set - exclude_set
+ feature_cols = py_list_to_sql_string(
+ list(feature_set), "text", False)
+ filtered_list_of_features = ",".join(
+ feat for feat in list(feature_set))
+ else:
+ feature_list = split_quoted_delimited_str(list_of_features)
+ feature_exclude = split_quoted_delimited_str(
+ list_of_features_to_exclude)
+ return_set = set(feature_list) - set(feature_exclude)
+ feature_cols = py_list_to_sql_string(
+ list(return_set), "text", False)
+ filtered_list_of_features = ",".join(
+ feat for feat in feature_list if feat in return_set)
+
+ output_cols = ''
+ if cols_to_output:
+ output_cols_list = [', '.join(get_cols(source_table, schema_madlib)) if col == '*' else col
+ for col in split_quoted_delimited_str(cols_to_output)]
+ output_cols = ', '.join(output_cols_list) + ","
+
+ plpy.execute("""
+ CREATE TABLE {output_table} AS
+ select {output_cols}
+ array[{filtered_list_of_features}] as feature_vector
+ from {source_table}
+ """.format(**locals()))
+
+ plpy.execute("""
--- End diff --
@iyerr3 The vec2cols story (https://issues.apache.org/jira/browse/MADLIB-1240) might consume this summary table if provided as the `dictionary` param in that module.
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r200788091
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,110 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
+ if not (list_of_features and list_of_features.strip()):
+ plpy.error("Features to include is empty")
+
+ if list_of_features.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, split_quoted_delimited_str(list_of_features)):
+ plpy.error(
+ "Invalid columns to list_of_features ({0})".format(list_of_features))
+
+ if cols_to_output and cols_to_output.strip() != '*':
+ if not columns_exist_in_table(
+ source_table, _string_to_array(cols_to_output)):
+ plpy.error("Invalid columns to output list ({0})".
+ format(cols_to_output))
+
+
+def cols2vec(schema_madlib, source_table, output_table, list_of_features,
+ list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
+ """
+ Args:
+ @param schema_madlib: Name of MADlib schema
+ @param model: Name of table containing the tree model
+ @param source_table: Name of table containing prediction data
+ @param output_table: Name of table to output the results
+ @param list_of_features: Comma-separated string of column names or
+ expressions to put into feature array.
+ Can also be a '*' implying all columns
+ are to be put into feature array.
+ @param list_of_features_to_exclude: Comma-separated string of column names
+ to exclude from the feature array
+ @param cols_to_output: Comma-separated string of column names
+ from the source table to keep in the output table,
+ in addition to the feature array.
+
+ Returns:
+ None
+
+ """
+
+ with MinWarning('warning'):
+ validate_cols2vec_args(source_table, output_table, list_of_features,
+ list_of_features_to_exclude, cols_to_output, **kwargs)
+
+ all_cols = ''
+ feature_cols = ''
+ if list_of_features.strip() == '*':
+ all_cols = get_cols(source_table, schema_madlib)
+ all_col_set = set(list(all_cols))
+ exclude_set = set(split_quoted_delimited_str(
+ list_of_features_to_exclude))
+ feature_set = all_col_set - exclude_set
+ feature_cols = py_list_to_sql_string(
+ list(feature_set), "text", False)
+ filtered_list_of_features = ",".join(
+ feat for feat in list(feature_set))
+ else:
+ feature_list = split_quoted_delimited_str(list_of_features)
+ feature_exclude = split_quoted_delimited_str(
+ list_of_features_to_exclude)
+ return_set = set(feature_list) - set(feature_exclude)
--- End diff --
Updated the above code as well along Also the order of features will remain same.
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r199981733
--- Diff: src/config/Modules.yml ---
@@ -50,3 +50,4 @@ modules:
- name: validation
depends: ['array_ops', 'regress']
- name: stemmer
+ - name: cols_vec
--- End diff --
I'm not convinced that we need a new module for this functionality. IMO this is better suited for the `utilities` module as a separate file.
---
[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...
Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on the issue:
https://github.com/apache/madlib/pull/288
@fmcquillan99 ,
What is total_rows_processed and total_rows_skipped ? Can you provide more details on these?
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:
https://github.com/apache/madlib/pull/288#discussion_r200787816
--- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
@@ -0,0 +1,110 @@
+"""
+@file cols2vec.py_in
+
+@brief Utility to convert Columns to array
+
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import split_quoted_delimited_str
+from utilities.utilities import _string_to_array
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import is_var_valid
+from utilities.validate_args import get_cols
+from utilities.validate_args import quote_ident
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`<!', `!>')
+
+
+def validate_cols2vec_args(source_table, output_table,
+ list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
+ """
+ Function to validate input parameters
+ """
+ if list_of_features.strip() != '*':
--- End diff --
This is done.
---
[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/madlib/pull/288
---