You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2017/02/09 23:35:42 UTC

[jira] [Updated] (MADLIB-1066) Pivoting - support array and svec output

     [ https://issues.apache.org/jira/browse/MADLIB-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan updated MADLIB-1066:
------------------------------------
    Summary: Pivoting - support array and svec output  (was: CLONE - Pivoting - Phase 3)

> Pivoting - support array and svec output
> ----------------------------------------
>
>                 Key: MADLIB-1066
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1066
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Minor
>             Fix For: v2.0
>
>
> Background
> Follow on to these JIRAs
> https://issues.apache.org/jira/browse/MADLIB-908
> https://issues.apache.org/jira/browse/MADLIB-1004
> this capability is to carry over some good ideas from
> https://issues.apache.org/jira/browse/MADLIB-1038
> Story
> Support array output format to allow > 1600 output columns (or PostgreSQL limit).  i.e., many MADlib algos take array input so pivot should support array output.  Base this on how it is done in encoding categorical variables http://madlib.incubator.apache.org/docs/latest/group__grp__encode__categorical.html
> Add 'output_type' to interface:
> {code}
> pivot(
>     source_table,
>     output_table,
>     index,
>     pivot_cols,
>     pivot_values,
>     aggregate_func,
>     fill_value,
>     keep_null,
>     output_col_dictionary,
>     output_type                          -- New
>     )
> {code}
> where
> {code}
> output_type (optional)
> VARCHAR. default: 'column'. This parameter controls the output format.  If 'column', a column is created for each output variable. PostgreSQL limits the number of columns in a table. If the total number of columns exceeds the limit, then make this parameter either 'array' to combine the indicator columns into an array or 'svec' to cast the array output to 'madlib.svec' type.
> Since the array output for any single tuple would be sparse, the 'svec' output would be most efficient for storage. The 'array' output is useful if the array is used for post-processing, including concatenating with other non-categorical features.
> A dictionary will be created when 'output_type' is 'array' or 'svec' to define an index into the array. The dictionary table will be given the name of the 'output_table' appended by '_dictionary'.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)