You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2017/02/09 23:25:41 UTC

[jira] [Updated] (MADLIB-1066) CLONE - Pivoting - Phase 3

     [ https://issues.apache.org/jira/browse/MADLIB-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan updated MADLIB-1066:
------------------------------------
    Description: 
Background

Follow on to these JIRAs
https://issues.apache.org/jira/browse/MADLIB-908
https://issues.apache.org/jira/browse/MADLIB-1004

this capability is to carry over some good ideas from
https://issues.apache.org/jira/browse/MADLIB-1038

Story

Support array output format to allow > 1600 output columns (or PostgreSQL limit).  i.e., many MADlib algos take array input so pivot should support array output.  Base this on how it is done in encoding categorical variables http://madlib.incubator.apache.org/docs/latest/group__grp__encode__categorical.html

Add 'output_type' to interface:

{code}
pivot(
    source_table,
    output_table,
    index,
    pivot_cols,
    pivot_values,
    aggregate_func,
    fill_value,
    keep_null,
    output_col_dictionary,
    output type                          -- New
    )
{code}

{code}
output_type (optional)
VARCHAR. default: 'column'. This parameter controls the output format.  If 'column', a column is created for each output variable. PostgreSQL limits the number of columns in a table. If the total number of columns exceeds the limit, then make this parameter either 'array' to combine the indicator columns into an array or 'svec' to cast the array output to 'madlib.svec' type.

Since the array output for any single tuple would be sparse, the 'svec' output would be most efficient for storage. The 'array' output is useful if the array is used for post-processing, including concatenating with other non-categorical features.

A dictionary will be created when 'output_type' is 'array' or 'svec' to define an index into the array. The dictionary table will be given the name of the 'output_table' appended by '_dictionary'.
{code}


  was:
Follow on to these JIRAs
https://issues.apache.org/jira/browse/MADLIB-908
https://issues.apache.org/jira/browse/MADLIB-1004

this capability is to carry over some good ideas from
https://issues.apache.org/jira/browse/MADLIB-1038

Candidate improvements:
* output column naming options
* adding an ‘*’ option and list of features to exclude
* pivot more than 1600 column limit, i.e., most MADlib algos take array input so pivot should support array output
* Support non-STRICT functions in Greenplum and HAWQ; this was removed in 1.9.1 since it is not handled correctly.  Does work OK for Postgres.
* others??? 

References

[1] Good data set
http://pbpython.com/pandas-pivot-table-explained.html


> CLONE - Pivoting - Phase 3
> --------------------------
>
>                 Key: MADLIB-1066
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1066
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>             Fix For: v2.0
>
>
> Background
> Follow on to these JIRAs
> https://issues.apache.org/jira/browse/MADLIB-908
> https://issues.apache.org/jira/browse/MADLIB-1004
> this capability is to carry over some good ideas from
> https://issues.apache.org/jira/browse/MADLIB-1038
> Story
> Support array output format to allow > 1600 output columns (or PostgreSQL limit).  i.e., many MADlib algos take array input so pivot should support array output.  Base this on how it is done in encoding categorical variables http://madlib.incubator.apache.org/docs/latest/group__grp__encode__categorical.html
> Add 'output_type' to interface:
> {code}
> pivot(
>     source_table,
>     output_table,
>     index,
>     pivot_cols,
>     pivot_values,
>     aggregate_func,
>     fill_value,
>     keep_null,
>     output_col_dictionary,
>     output type                          -- New
>     )
> {code}
> {code}
> output_type (optional)
> VARCHAR. default: 'column'. This parameter controls the output format.  If 'column', a column is created for each output variable. PostgreSQL limits the number of columns in a table. If the total number of columns exceeds the limit, then make this parameter either 'array' to combine the indicator columns into an array or 'svec' to cast the array output to 'madlib.svec' type.
> Since the array output for any single tuple would be sparse, the 'svec' output would be most efficient for storage. The 'array' output is useful if the array is used for post-processing, including concatenating with other non-categorical features.
> A dictionary will be created when 'output_type' is 'array' or 'svec' to define an index into the array. The dictionary table will be given the name of the 'output_table' appended by '_dictionary'.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)