You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by hpandeycodeit <gi...@git.apache.org> on 2018/07/03 22:22:07 UTC

[GitHub] madlib pull request #288: Madlib 1239

GitHub user hpandeycodeit opened a pull request:

    https://github.com/apache/madlib/pull/288

    Madlib 1239

    JIRA: 1239
    
    Added a new module cols_vec which Converts features from multiple columns of an input table into a feature array in a single column.
    
    Following files are committed:
    
    cols2vec.py_in
    cols2vec.sql_in
    test/cols2vec.sql_in
    Modules.yml
    
    For special characters handling, using the py_list_to_string with "long_format = False".
     Also, split_quoted_delimited_str which quotes each element of the array. 
    Tests with special characters are added in the install check. 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hpandeycodeit/incubator-madlib MADLIB_1239

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/288.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #288
    
----
commit 3b5a7780dee45edb215351879c892945d2c59c4c
Author: hpandeycodeit <hp...@...>
Date:   2018-06-15T08:33:27Z

        Changes for Jira: 1239, Converts features from multiple columns into a feature array

commit 06bef61bffe71bae5bb9956c650d39ae7e0d5ef8
Author: hpandeycodeit <hp...@...>
Date:   2018-06-15T08:41:10Z

    added Module cols_vec

commit 40f08e1f17e53fc8b3fddb873f31f35be0aeb7f6
Author: hpandeycodeit <hp...@...>
Date:   2018-06-15T08:43:23Z

    added Module cols_vec

commit 3050607311059acde59d8a309eb10bb745a31469
Author: hpandeycodeit <hp...@...>
Date:   2018-06-18T22:13:31Z

      Fixed the test file for install-check failure.

commit 7dce06b3e7380ac8a97021943246450568b6e242
Author: hpandeycodeit <hp...@...>
Date:   2018-07-03T22:16:36Z

     Changes for Jira: 1239, Converts features from multiple columns into a feature array

----


---

[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/288
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/536/



---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r200891497
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,104 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.utilities import _assert
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +        _assert(
    +            columns_exist_in_table(
    +                source_table, split_quoted_delimited_str(list_of_features)),
    +            "Invalid columns to list of features {0}".format(list_of_features))
    +
    +    if cols_to_output and cols_to_output.strip() != '*':
    +        _assert(
    +            columns_exist_in_table(
    +                source_table, _string_to_array(cols_to_output)),
    +            "Invalid columns to output list {0}".format(cols_to_output))
    +
    +
    +def cols2vec(schema_madlib, source_table, output_table, list_of_features,
    +             list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
    +    """
    +    Args:
    +        @param schema_madlib:               Name of MADlib schema
    +        @param model:                       Name of table containing the tree model
    +        @param source_table:                Name of table containing prediction data
    +        @param output_table:                Name of table to output the results
    +        @param list_of_features:            Comma-separated string of column names or
    +                                            expressions to put into feature array.
    +                                            Can also be a '*' implying all columns
    +                                            are to be put into feature array.
    +        @param list_of_features_to_exclude: Comma-separated string of column names
    +                                            to exclude from the feature array
    +        @param cols_to_output:              Comma-separated string of column names
    +                                            from the source table to keep in the output table,
    +                                            in addition to the feature array.
    +
    +    Returns:
    +        None
    +
    +    """
    +
    +    with MinWarning('warning'):
    +        validate_cols2vec_args(source_table, output_table, list_of_features,
    +                               list_of_features_to_exclude, cols_to_output, **kwargs)
    +
    +        all_cols = ''
    +        feature_cols = ''
    +        feature_list = ''
    +        if list_of_features.strip() == '*':
    +            all_cols = get_cols(source_table, schema_madlib)
    +            all_col_set = set(list(all_cols))
    +            exclude_set = set(split_quoted_delimited_str(
    +                list_of_features_to_exclude))
    +            feature_list = list(all_col_set - exclude_set)
    +        else:
    +            feature_list = split_quoted_delimited_str(list_of_features)
    +
    +        feature_cols = py_list_to_sql_string(
    +            list(feature_list), "text", False)
    +        filtered_list_of_features = ",".join(
    --- End diff --
    
    Above changes are done as suggested. 


---

[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/288
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/535/



---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r199982717
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,110 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +
    +    if list_of_features.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, split_quoted_delimited_str(list_of_features)):
    +            plpy.error(
    +                "Invalid columns to list_of_features ({0})".format(list_of_features))
    +
    +    if cols_to_output and cols_to_output.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, _string_to_array(cols_to_output)):
    +            plpy.error("Invalid columns to output list ({0})".
    +                       format(cols_to_output))
    +
    +
    +def cols2vec(schema_madlib, source_table, output_table, list_of_features,
    +             list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
    +    """
    +    Args:
    +        @param schema_madlib:               Name of MADlib schema
    +        @param model:                       Name of table containing the tree model
    +        @param source_table:                Name of table containing prediction data
    +        @param output_table:                Name of table to output the results
    +        @param list_of_features:            Comma-separated string of column names or
    +                                            expressions to put into feature array.
    +                                            Can also be a '*' implying all columns
    +                                            are to be put into feature array.
    +        @param list_of_features_to_exclude: Comma-separated string of column names
    +                                            to exclude from the feature array
    +        @param cols_to_output:              Comma-separated string of column names
    +                                            from the source table to keep in the output table,
    +                                            in addition to the feature array.
    +
    +    Returns:
    +        None
    +
    +    """
    +
    +    with MinWarning('warning'):
    +        validate_cols2vec_args(source_table, output_table, list_of_features,
    +                               list_of_features_to_exclude, cols_to_output, **kwargs)
    +
    +        all_cols = ''
    +        feature_cols = ''
    +        if list_of_features.strip() == '*':
    --- End diff --
    
    The `if` and `else` blocks seem to be very similar except for the source of the `all_cols/feature_list`. I suggest using the if switch only for populating the source columns. Other statements can be moved out of the if. 


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r200790240
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,104 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.utilities import _assert
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +        _assert(
    +            columns_exist_in_table(
    +                source_table, split_quoted_delimited_str(list_of_features)),
    +            "Invalid columns to list of features {0}".format(list_of_features))
    +
    +    if cols_to_output and cols_to_output.strip() != '*':
    +        _assert(
    +            columns_exist_in_table(
    +                source_table, _string_to_array(cols_to_output)),
    +            "Invalid columns to output list {0}".format(cols_to_output))
    +
    +
    +def cols2vec(schema_madlib, source_table, output_table, list_of_features,
    +             list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
    +    """
    +    Args:
    +        @param schema_madlib:               Name of MADlib schema
    +        @param model:                       Name of table containing the tree model
    +        @param source_table:                Name of table containing prediction data
    +        @param output_table:                Name of table to output the results
    +        @param list_of_features:            Comma-separated string of column names or
    +                                            expressions to put into feature array.
    +                                            Can also be a '*' implying all columns
    +                                            are to be put into feature array.
    +        @param list_of_features_to_exclude: Comma-separated string of column names
    +                                            to exclude from the feature array
    +        @param cols_to_output:              Comma-separated string of column names
    +                                            from the source table to keep in the output table,
    +                                            in addition to the feature array.
    +
    +    Returns:
    +        None
    +
    +    """
    +
    +    with MinWarning('warning'):
    +        validate_cols2vec_args(source_table, output_table, list_of_features,
    +                               list_of_features_to_exclude, cols_to_output, **kwargs)
    +
    +        all_cols = ''
    --- End diff --
    
    We can delete lines 70, 71, 72 since we don't need those anymore. 


---

[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/288
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/537/



---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r199981970
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,110 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    --- End diff --
    
    The checks below are better expressed as assert statements using `_assert(...)`


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r200787876
  
    --- Diff: src/config/Modules.yml ---
    @@ -50,3 +50,4 @@ modules:
         - name: validation
           depends: ['array_ops', 'regress']
         - name: stemmer
    +    - name: cols_vec
    --- End diff --
    
    Discussed it with @fmcquillan99 and I will move it under Utilities module in next commit. 


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r199983644
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,110 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +
    +    if list_of_features.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, split_quoted_delimited_str(list_of_features)):
    +            plpy.error(
    +                "Invalid columns to list_of_features ({0})".format(list_of_features))
    +
    +    if cols_to_output and cols_to_output.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, _string_to_array(cols_to_output)):
    +            plpy.error("Invalid columns to output list ({0})".
    +                       format(cols_to_output))
    +
    +
    +def cols2vec(schema_madlib, source_table, output_table, list_of_features,
    +             list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
    +    """
    +    Args:
    +        @param schema_madlib:               Name of MADlib schema
    +        @param model:                       Name of table containing the tree model
    +        @param source_table:                Name of table containing prediction data
    +        @param output_table:                Name of table to output the results
    +        @param list_of_features:            Comma-separated string of column names or
    +                                            expressions to put into feature array.
    +                                            Can also be a '*' implying all columns
    +                                            are to be put into feature array.
    +        @param list_of_features_to_exclude: Comma-separated string of column names
    +                                            to exclude from the feature array
    +        @param cols_to_output:              Comma-separated string of column names
    +                                            from the source table to keep in the output table,
    +                                            in addition to the feature array.
    +
    +    Returns:
    +        None
    +
    +    """
    +
    +    with MinWarning('warning'):
    +        validate_cols2vec_args(source_table, output_table, list_of_features,
    +                               list_of_features_to_exclude, cols_to_output, **kwargs)
    +
    +        all_cols = ''
    +        feature_cols = ''
    +        if list_of_features.strip() == '*':
    +            all_cols = get_cols(source_table, schema_madlib)
    +            all_col_set = set(list(all_cols))
    +            exclude_set = set(split_quoted_delimited_str(
    +                list_of_features_to_exclude))
    +            feature_set = all_col_set - exclude_set
    +            feature_cols = py_list_to_sql_string(
    +                list(feature_set), "text", False)
    +            filtered_list_of_features = ",".join(
    +                feat for feat in list(feature_set))
    +        else:
    +            feature_list = split_quoted_delimited_str(list_of_features)
    +            feature_exclude = split_quoted_delimited_str(
    +                list_of_features_to_exclude)
    +            return_set = set(feature_list) - set(feature_exclude)
    +            feature_cols = py_list_to_sql_string(
    +                list(return_set), "text", False)
    +            filtered_list_of_features = ",".join(
    +                feat for feat in feature_list if feat in return_set)
    +
    +        output_cols = ''
    +        if cols_to_output:
    +            output_cols_list = [', '.join(get_cols(source_table, schema_madlib)) if col == '*' else col
    +                                for col in split_quoted_delimited_str(cols_to_output)]
    +            output_cols = ', '.join(output_cols_list) + ","
    +
    +        plpy.execute("""
    +    		CREATE TABLE {output_table} AS
    +    		select {output_cols}
    +            array[{filtered_list_of_features}] as feature_vector
    +    		from {source_table}
    +    		""".format(**locals()))
    +
    +        plpy.execute("""
    --- End diff --
    
    I haven't understood the need for the summary table. Is it just to record the features combined in the array? Can we provide that as an output of the function? Creating a table just to record that one parameter seems unnecessary. 


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r200790286
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,104 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.utilities import _assert
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +        _assert(
    +            columns_exist_in_table(
    +                source_table, split_quoted_delimited_str(list_of_features)),
    +            "Invalid columns to list of features {0}".format(list_of_features))
    +
    +    if cols_to_output and cols_to_output.strip() != '*':
    +        _assert(
    +            columns_exist_in_table(
    +                source_table, _string_to_array(cols_to_output)),
    +            "Invalid columns to output list {0}".format(cols_to_output))
    +
    +
    +def cols2vec(schema_madlib, source_table, output_table, list_of_features,
    +             list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
    +    """
    +    Args:
    +        @param schema_madlib:               Name of MADlib schema
    +        @param model:                       Name of table containing the tree model
    +        @param source_table:                Name of table containing prediction data
    +        @param output_table:                Name of table to output the results
    +        @param list_of_features:            Comma-separated string of column names or
    +                                            expressions to put into feature array.
    +                                            Can also be a '*' implying all columns
    +                                            are to be put into feature array.
    +        @param list_of_features_to_exclude: Comma-separated string of column names
    +                                            to exclude from the feature array
    +        @param cols_to_output:              Comma-separated string of column names
    +                                            from the source table to keep in the output table,
    +                                            in addition to the feature array.
    +
    +    Returns:
    +        None
    +
    +    """
    +
    +    with MinWarning('warning'):
    +        validate_cols2vec_args(source_table, output_table, list_of_features,
    +                               list_of_features_to_exclude, cols_to_output, **kwargs)
    +
    +        all_cols = ''
    +        feature_cols = ''
    +        feature_list = ''
    +        if list_of_features.strip() == '*':
    +            all_cols = get_cols(source_table, schema_madlib)
    +            all_col_set = set(list(all_cols))
    +            exclude_set = set(split_quoted_delimited_str(
    +                list_of_features_to_exclude))
    +            feature_list = list(all_col_set - exclude_set)
    +        else:
    +            feature_list = split_quoted_delimited_str(list_of_features)
    +
    +        feature_cols = py_list_to_sql_string(
    +            list(feature_list), "text", False)
    +        filtered_list_of_features = ",".join(
    --- End diff --
    
    `filtered_list_of_features = ",".join(feature_list)`


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r199982049
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,110 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +
    +    if list_of_features.strip() != '*':
    --- End diff --
    
    Please combine this with the above if statement


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r199983404
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,110 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +
    +    if list_of_features.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, split_quoted_delimited_str(list_of_features)):
    +            plpy.error(
    +                "Invalid columns to list_of_features ({0})".format(list_of_features))
    +
    +    if cols_to_output and cols_to_output.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, _string_to_array(cols_to_output)):
    +            plpy.error("Invalid columns to output list ({0})".
    +                       format(cols_to_output))
    +
    +
    +def cols2vec(schema_madlib, source_table, output_table, list_of_features,
    +             list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
    +    """
    +    Args:
    +        @param schema_madlib:               Name of MADlib schema
    +        @param model:                       Name of table containing the tree model
    +        @param source_table:                Name of table containing prediction data
    +        @param output_table:                Name of table to output the results
    +        @param list_of_features:            Comma-separated string of column names or
    +                                            expressions to put into feature array.
    +                                            Can also be a '*' implying all columns
    +                                            are to be put into feature array.
    +        @param list_of_features_to_exclude: Comma-separated string of column names
    +                                            to exclude from the feature array
    +        @param cols_to_output:              Comma-separated string of column names
    +                                            from the source table to keep in the output table,
    +                                            in addition to the feature array.
    +
    +    Returns:
    +        None
    +
    +    """
    +
    +    with MinWarning('warning'):
    +        validate_cols2vec_args(source_table, output_table, list_of_features,
    +                               list_of_features_to_exclude, cols_to_output, **kwargs)
    +
    +        all_cols = ''
    +        feature_cols = ''
    +        if list_of_features.strip() == '*':
    +            all_cols = get_cols(source_table, schema_madlib)
    +            all_col_set = set(list(all_cols))
    +            exclude_set = set(split_quoted_delimited_str(
    +                list_of_features_to_exclude))
    +            feature_set = all_col_set - exclude_set
    +            feature_cols = py_list_to_sql_string(
    +                list(feature_set), "text", False)
    +            filtered_list_of_features = ",".join(
    +                feat for feat in list(feature_set))
    +        else:
    +            feature_list = split_quoted_delimited_str(list_of_features)
    +            feature_exclude = split_quoted_delimited_str(
    +                list_of_features_to_exclude)
    +            return_set = set(feature_list) - set(feature_exclude)
    --- End diff --
    
    The order of the features are lost in this operation.


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r200790203
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,104 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.utilities import _assert
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +        _assert(
    +            columns_exist_in_table(
    +                source_table, split_quoted_delimited_str(list_of_features)),
    +            "Invalid columns to list of features {0}".format(list_of_features))
    +
    +    if cols_to_output and cols_to_output.strip() != '*':
    +        _assert(
    +            columns_exist_in_table(
    +                source_table, _string_to_array(cols_to_output)),
    +            "Invalid columns to output list {0}".format(cols_to_output))
    +
    +
    +def cols2vec(schema_madlib, source_table, output_table, list_of_features,
    +             list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
    +    """
    +    Args:
    +        @param schema_madlib:               Name of MADlib schema
    +        @param model:                       Name of table containing the tree model
    +        @param source_table:                Name of table containing prediction data
    +        @param output_table:                Name of table to output the results
    +        @param list_of_features:            Comma-separated string of column names or
    +                                            expressions to put into feature array.
    +                                            Can also be a '*' implying all columns
    +                                            are to be put into feature array.
    +        @param list_of_features_to_exclude: Comma-separated string of column names
    +                                            to exclude from the feature array
    +        @param cols_to_output:              Comma-separated string of column names
    +                                            from the source table to keep in the output table,
    +                                            in addition to the feature array.
    +
    +    Returns:
    +        None
    +
    +    """
    +
    +    with MinWarning('warning'):
    +        validate_cols2vec_args(source_table, output_table, list_of_features,
    +                               list_of_features_to_exclude, cols_to_output, **kwargs)
    +
    +        all_cols = ''
    +        feature_cols = ''
    +        feature_list = ''
    +        if list_of_features.strip() == '*':
    +            all_cols = get_cols(source_table, schema_madlib)
    +            all_col_set = set(list(all_cols))
    --- End diff --
    
    The order of the columns (retained by `get_cols`) is lost here. I suggest: 
    ```
    feature_list = [col for col in all_cols if col not in exclude_set]
    ```


---

[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/288
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/539/



---

[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/288
  
    Since we are writing out a summary table, may as well add more info in it.
    
    {code}
    A summary table named <out_table>_summary is also created at the same time, which has the following columns:
    
    source_table                      TEXT. Source table name.
    feature_names                   TEXT[]. Array of names of features.
    total_rows_processed        INTEGER. Total numbers of rows processed.
    total_rows_skipped            INTEGER.  Total numbers of rows skipped due to failures.
    {code}
    
    Is this do-able @hpandeycodeit ?



---

[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/288
  
    update my comment above to remove the rows processed and skipped.



---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r200510366
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,110 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +
    +    if list_of_features.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, split_quoted_delimited_str(list_of_features)):
    +            plpy.error(
    +                "Invalid columns to list_of_features ({0})".format(list_of_features))
    +
    +    if cols_to_output and cols_to_output.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, _string_to_array(cols_to_output)):
    +            plpy.error("Invalid columns to output list ({0})".
    +                       format(cols_to_output))
    +
    +
    +def cols2vec(schema_madlib, source_table, output_table, list_of_features,
    +             list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
    +    """
    +    Args:
    +        @param schema_madlib:               Name of MADlib schema
    +        @param model:                       Name of table containing the tree model
    +        @param source_table:                Name of table containing prediction data
    +        @param output_table:                Name of table to output the results
    +        @param list_of_features:            Comma-separated string of column names or
    +                                            expressions to put into feature array.
    +                                            Can also be a '*' implying all columns
    +                                            are to be put into feature array.
    +        @param list_of_features_to_exclude: Comma-separated string of column names
    +                                            to exclude from the feature array
    +        @param cols_to_output:              Comma-separated string of column names
    +                                            from the source table to keep in the output table,
    +                                            in addition to the feature array.
    +
    +    Returns:
    +        None
    +
    +    """
    +
    +    with MinWarning('warning'):
    +        validate_cols2vec_args(source_table, output_table, list_of_features,
    +                               list_of_features_to_exclude, cols_to_output, **kwargs)
    +
    +        all_cols = ''
    +        feature_cols = ''
    +        if list_of_features.strip() == '*':
    +            all_cols = get_cols(source_table, schema_madlib)
    +            all_col_set = set(list(all_cols))
    +            exclude_set = set(split_quoted_delimited_str(
    +                list_of_features_to_exclude))
    +            feature_set = all_col_set - exclude_set
    +            feature_cols = py_list_to_sql_string(
    +                list(feature_set), "text", False)
    +            filtered_list_of_features = ",".join(
    +                feat for feat in list(feature_set))
    +        else:
    +            feature_list = split_quoted_delimited_str(list_of_features)
    +            feature_exclude = split_quoted_delimited_str(
    +                list_of_features_to_exclude)
    +            return_set = set(feature_list) - set(feature_exclude)
    +            feature_cols = py_list_to_sql_string(
    +                list(return_set), "text", False)
    +            filtered_list_of_features = ",".join(
    +                feat for feat in feature_list if feat in return_set)
    +
    +        output_cols = ''
    +        if cols_to_output:
    +            output_cols_list = [', '.join(get_cols(source_table, schema_madlib)) if col == '*' else col
    +                                for col in split_quoted_delimited_str(cols_to_output)]
    +            output_cols = ', '.join(output_cols_list) + ","
    +
    +        plpy.execute("""
    +    		CREATE TABLE {output_table} AS
    +    		select {output_cols}
    +            array[{filtered_list_of_features}] as feature_vector
    +    		from {source_table}
    +    		""".format(**locals()))
    +
    +        plpy.execute("""
    --- End diff --
    
    If there are 1000+ columns which you want to keep track of, then saving the array or col names in a summary tables might be convenient.   It is not ideal but should not be onerous to the user and they can ignore the summary table if they don't care about it.


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r200444598
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,110 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +
    +    if list_of_features.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, split_quoted_delimited_str(list_of_features)):
    +            plpy.error(
    +                "Invalid columns to list_of_features ({0})".format(list_of_features))
    +
    +    if cols_to_output and cols_to_output.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, _string_to_array(cols_to_output)):
    +            plpy.error("Invalid columns to output list ({0})".
    +                       format(cols_to_output))
    +
    +
    +def cols2vec(schema_madlib, source_table, output_table, list_of_features,
    +             list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
    +    """
    +    Args:
    +        @param schema_madlib:               Name of MADlib schema
    +        @param model:                       Name of table containing the tree model
    +        @param source_table:                Name of table containing prediction data
    +        @param output_table:                Name of table to output the results
    +        @param list_of_features:            Comma-separated string of column names or
    +                                            expressions to put into feature array.
    +                                            Can also be a '*' implying all columns
    +                                            are to be put into feature array.
    +        @param list_of_features_to_exclude: Comma-separated string of column names
    +                                            to exclude from the feature array
    +        @param cols_to_output:              Comma-separated string of column names
    +                                            from the source table to keep in the output table,
    +                                            in addition to the feature array.
    +
    +    Returns:
    +        None
    +
    +    """
    +
    +    with MinWarning('warning'):
    +        validate_cols2vec_args(source_table, output_table, list_of_features,
    +                               list_of_features_to_exclude, cols_to_output, **kwargs)
    +
    +        all_cols = ''
    +        feature_cols = ''
    +        if list_of_features.strip() == '*':
    +            all_cols = get_cols(source_table, schema_madlib)
    +            all_col_set = set(list(all_cols))
    +            exclude_set = set(split_quoted_delimited_str(
    +                list_of_features_to_exclude))
    +            feature_set = all_col_set - exclude_set
    +            feature_cols = py_list_to_sql_string(
    +                list(feature_set), "text", False)
    +            filtered_list_of_features = ",".join(
    +                feat for feat in list(feature_set))
    +        else:
    +            feature_list = split_quoted_delimited_str(list_of_features)
    +            feature_exclude = split_quoted_delimited_str(
    +                list_of_features_to_exclude)
    +            return_set = set(feature_list) - set(feature_exclude)
    +            feature_cols = py_list_to_sql_string(
    +                list(return_set), "text", False)
    +            filtered_list_of_features = ",".join(
    +                feat for feat in feature_list if feat in return_set)
    +
    +        output_cols = ''
    +        if cols_to_output:
    +            output_cols_list = [', '.join(get_cols(source_table, schema_madlib)) if col == '*' else col
    +                                for col in split_quoted_delimited_str(cols_to_output)]
    +            output_cols = ', '.join(output_cols_list) + ","
    +
    +        plpy.execute("""
    +    		CREATE TABLE {output_table} AS
    +    		select {output_cols}
    +            array[{filtered_list_of_features}] as feature_vector
    +    		from {source_table}
    +    		""".format(**locals()))
    +
    +        plpy.execute("""
    --- End diff --
    
    @iyerr3  The vec2cols story (https://issues.apache.org/jira/browse/MADLIB-1240) might consume this summary table if provided as the `dictionary` param in that module. 


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r200788091
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,110 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    +        if not (list_of_features and list_of_features.strip()):
    +            plpy.error("Features to include is empty")
    +
    +    if list_of_features.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, split_quoted_delimited_str(list_of_features)):
    +            plpy.error(
    +                "Invalid columns to list_of_features ({0})".format(list_of_features))
    +
    +    if cols_to_output and cols_to_output.strip() != '*':
    +        if not columns_exist_in_table(
    +                source_table, _string_to_array(cols_to_output)):
    +            plpy.error("Invalid columns to output list ({0})".
    +                       format(cols_to_output))
    +
    +
    +def cols2vec(schema_madlib, source_table, output_table, list_of_features,
    +             list_of_features_to_exclude=None, cols_to_output=None, **kwargs):
    +    """
    +    Args:
    +        @param schema_madlib:               Name of MADlib schema
    +        @param model:                       Name of table containing the tree model
    +        @param source_table:                Name of table containing prediction data
    +        @param output_table:                Name of table to output the results
    +        @param list_of_features:            Comma-separated string of column names or
    +                                            expressions to put into feature array.
    +                                            Can also be a '*' implying all columns
    +                                            are to be put into feature array.
    +        @param list_of_features_to_exclude: Comma-separated string of column names
    +                                            to exclude from the feature array
    +        @param cols_to_output:              Comma-separated string of column names
    +                                            from the source table to keep in the output table,
    +                                            in addition to the feature array.
    +
    +    Returns:
    +        None
    +
    +    """
    +
    +    with MinWarning('warning'):
    +        validate_cols2vec_args(source_table, output_table, list_of_features,
    +                               list_of_features_to_exclude, cols_to_output, **kwargs)
    +
    +        all_cols = ''
    +        feature_cols = ''
    +        if list_of_features.strip() == '*':
    +            all_cols = get_cols(source_table, schema_madlib)
    +            all_col_set = set(list(all_cols))
    +            exclude_set = set(split_quoted_delimited_str(
    +                list_of_features_to_exclude))
    +            feature_set = all_col_set - exclude_set
    +            feature_cols = py_list_to_sql_string(
    +                list(feature_set), "text", False)
    +            filtered_list_of_features = ",".join(
    +                feat for feat in list(feature_set))
    +        else:
    +            feature_list = split_quoted_delimited_str(list_of_features)
    +            feature_exclude = split_quoted_delimited_str(
    +                list_of_features_to_exclude)
    +            return_set = set(feature_list) - set(feature_exclude)
    --- End diff --
    
    Updated the above code as well along Also the order of features will remain same. 


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r199981733
  
    --- Diff: src/config/Modules.yml ---
    @@ -50,3 +50,4 @@ modules:
         - name: validation
           depends: ['array_ops', 'regress']
         - name: stemmer
    +    - name: cols_vec
    --- End diff --
    
    I'm not convinced that we need a new module for this functionality. IMO this is better suited for the `utilities` module as a separate file.  


---

[GitHub] madlib issue #288: Jira:1239: Converts features from multiple columns into a...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on the issue:

    https://github.com/apache/madlib/pull/288
  
    @fmcquillan99 , 
    
    What is total_rows_processed and total_rows_skipped ? Can you provide more details on these? 


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by hpandeycodeit <gi...@git.apache.org>.
Github user hpandeycodeit commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/288#discussion_r200787816
  
    --- Diff: src/ports/postgres/modules/cols_vec/cols2vec.py_in ---
    @@ -0,0 +1,110 @@
    +"""
    +@file cols2vec.py_in
    +
    +@brief Utility to convert Columns to array
    +
    +"""
    +
    +import plpy
    +from utilities.control import MinWarning
    +from utilities.utilities import split_quoted_delimited_str
    +from utilities.utilities import _string_to_array
    +from utilities.validate_args import columns_exist_in_table
    +from utilities.validate_args import is_var_valid
    +from utilities.validate_args import get_cols
    +from utilities.validate_args import quote_ident
    +from utilities.utilities import py_list_to_sql_string
    +
    +
    +m4_changequote(`<!', `!>')
    +
    +
    +def validate_cols2vec_args(source_table, output_table,
    +                           list_of_features, list_of_features_to_exclude, cols_to_output, **kwargs):
    +    """
    +        Function to validate input parameters
    +    """
    +    if list_of_features.strip() != '*':
    --- End diff --
    
    This is done. 


---

[GitHub] madlib pull request #288: Jira:1239: Converts features from multiple columns...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/madlib/pull/288


---