You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/05/23 20:18:00 UTC

[jira] [Updated] (MADLIB-1239) Columns to Vector

     [ https://issues.apache.org/jira/browse/MADLIB-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan updated MADLIB-1239:
------------------------------------
    Description: 
Columns to Vector

Converts features from multiple columns of an input table into a feature array in a single column.  Also outputs the names of the features into an array in a single column. This process can be reversed using the function vec2cols.

{code}
cols2vec(
    source_table,
    out_table,
    list_of_features,
    list_of_features_to_exclude,
    cols_to_output
    )

source_table
TEXT. Name of the table containing the source data.

out_table
TEXT. Name of the generated table containing the output. If a table with the same name already exists, an error will be returned. 

list_of_features
TEXT. Comma-separated string of column names or expressions to put into array. Can also be a '*' implying all columns are to be put into array (except for the ones included in the next argument that lists exclusions). The types of the features should all be the same since PostgreSQL arrays only support elements of the same type.  If multiple numeric types are present in the list of features, they will be cast to DOUBLE PRECISION in the feature array.

Array columns can also be included in the list, and the array will be expanded to treat each element of the array as a separate feature.

list_of_features_to_exclude (optional)
TEXT, default NULL. Comma-separated string of column names to exclude from the feature array.  Use only when list_of_features is '*'.

cols_to_output (optional)
TEXT, default NULL. Comma-separated string of column names from the source table to keep in the output table, in addition to the feature array.  To keep all columns from the source table, use '*'.


Output

The output table produced by the cols2vec function contains the following columns:

<...>
Columns from source table, depending on which ones are kept (if any).

feature_vector
Array of features.  Array type will depend on feature type in the source table.

feature_names
TEXT[] Array of names of features.
{code}

Open questions for cols2vec:

1) OK to cast to double if there are mixed numeric types?  Or should we enforce that all feature columns be the exact same type? (Since elements of PostgreSQL arrays need to be the same type.)


Aside

The function
http://pivotalsoftware.github.io/PDLTools/group__grp__array__utilities.html#cols2vec_example
is similar but the proposed MADlib one has more options.  To do the equivalent of the PDL Tools one in MADlib, you would do:

{code}
cols2vec(
    table_name,
    output_table,
    '*',
    exclude_columns
    )
{code}

  was:
Columns to Vector

Converts features from multiple columns of an input table into a feature array in a single column.  Also outputs the names of the features into an array in a single column. This process can be reversed using the function vec2cols.

{code}
cols2vec(
    source_table,
    out_table,
    list_of_features,
    list_of_features_to_exclude,
    cols_to_output
    )

source_table
TEXT. Name of the table containing the source data.

out_table
TEXT. Name of the generated table containing the output. If a table with the same name already exists, an error will be returned. 

list_of_features
TEXT. Comma-separated string of column names or expressions to put into array. Can also be a '*' implying all columns are to be put into array (except for the ones included in the next argument that lists exclusions). The types of the features should all be the same since PostgreSQL arrays only support elements of the same type.  If multiple numeric types are present in the source table, they will be cast to DOUBLE PRECISION.

Array columns can also be included in the list, and the array will be expanded to treat each element of the array as a separate feature.

list_of_features_to_exclude (optional)
TEXT, default NULL. Comma-separated string of column names to exclude from the feature array.  Use only when list_of_features is '*'.

cols_to_output (optional)
TEXT, default NULL. Comma-separated string of column names from the source table to keep in the output table, in addition to the feature array.  To keep all columns from the source table, use '*'.


Output

The output table produced by the cols2vec function contains the following columns:

<...>
Columns from source table, depending on which ones are kept (if any).

feature_vector
Array of features.  Array type will depend on feature type in the source table.

feature_names
TEXT[] Array of names of features.
{code}

Open questions for cols2vec:

1) OK to cast to double if there are mixed numeric types?  Or should we enforce that all feature columns be the exact same type? (Since elements of PostgreSQL arrays need to be the same type.)


Aside

The function
http://pivotalsoftware.github.io/PDLTools/group__grp__array__utilities.html#cols2vec_example
is similar but the proposed MADlib one has more options.  To do the equivalent of the PDL Tools one in MADlib, you would do:

{code}
cols2vec(
    table_name,
    output_table,
    '*',
    exclude_columns
    )
{code}


> Columns to Vector
> -----------------
>
>                 Key: MADLIB-1239
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1239
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.15
>
>
> Columns to Vector
> Converts features from multiple columns of an input table into a feature array in a single column.  Also outputs the names of the features into an array in a single column. This process can be reversed using the function vec2cols.
> {code}
> cols2vec(
>     source_table,
>     out_table,
>     list_of_features,
>     list_of_features_to_exclude,
>     cols_to_output
>     )
> source_table
> TEXT. Name of the table containing the source data.
> out_table
> TEXT. Name of the generated table containing the output. If a table with the same name already exists, an error will be returned. 
> list_of_features
> TEXT. Comma-separated string of column names or expressions to put into array. Can also be a '*' implying all columns are to be put into array (except for the ones included in the next argument that lists exclusions). The types of the features should all be the same since PostgreSQL arrays only support elements of the same type.  If multiple numeric types are present in the list of features, they will be cast to DOUBLE PRECISION in the feature array.
> Array columns can also be included in the list, and the array will be expanded to treat each element of the array as a separate feature.
> list_of_features_to_exclude (optional)
> TEXT, default NULL. Comma-separated string of column names to exclude from the feature array.  Use only when list_of_features is '*'.
> cols_to_output (optional)
> TEXT, default NULL. Comma-separated string of column names from the source table to keep in the output table, in addition to the feature array.  To keep all columns from the source table, use '*'.
> Output
> The output table produced by the cols2vec function contains the following columns:
> <...>
> Columns from source table, depending on which ones are kept (if any).
> feature_vector
> Array of features.  Array type will depend on feature type in the source table.
> feature_names
> TEXT[] Array of names of features.
> {code}
> Open questions for cols2vec:
> 1) OK to cast to double if there are mixed numeric types?  Or should we enforce that all feature columns be the exact same type? (Since elements of PostgreSQL arrays need to be the same type.)
> Aside
> The function
> http://pivotalsoftware.github.io/PDLTools/group__grp__array__utilities.html#cols2vec_example
> is similar but the proposed MADlib one has more options.  To do the equivalent of the PDL Tools one in MADlib, you would do:
> {code}
> cols2vec(
>     table_name,
>     output_table,
>     '*',
>     exclude_columns
>     )
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)