You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/03/15 17:40:48 UTC

[jira] [Commented] (MADLIB-1066) Pivoting - support array and svec output

    [ https://issues.apache.org/jira/browse/MADLIB-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926625#comment-15926625 ] 

ASF GitHub Bot commented on MADLIB-1066:
----------------------------------------

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/incubator-madlib/pull/108

    Pivot: Add support for array output

    JIRA: MADLIB-1066
    
    When total pivoted columns exceed 1600, an array output becomes
    essential. This commit adds support to get each pivoted set of columns
    (all columns related to a particular value-aggregate combination) as an
    array. There is also support for getting the output as madlib.svec.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyerr3/incubator-madlib feature/pivot_array_support

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/108.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #108
    
----
commit fe579d1300f0c30eeb72e7d8b411af9cdffe2c59
Author: Rahul Iyer <ri...@apache.org>
Date:   2017-03-11T00:45:03Z

    Pivot: Add support for array output
    
    JIRA: MADLIB-1066
    
    When total pivoted columns exceed 1600, an array output becomes
    essential. This commit adds support to get each pivoted set of columns
    (all columns related to a particular value-aggregate combination) as an
    array. There is also support for getting the output as madlib.svec.

----


> Pivoting - support array and svec output
> ----------------------------------------
>
>                 Key: MADLIB-1066
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1066
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Priority: Minor
>             Fix For: v1.11
>
>
> Background
> Follow on to these JIRAs
> https://issues.apache.org/jira/browse/MADLIB-908
> https://issues.apache.org/jira/browse/MADLIB-1004
> this capability is to carry over some good ideas from
> https://issues.apache.org/jira/browse/MADLIB-1038
> Story
> Support array output format to allow > 1600 output columns (or PostgreSQL limit).  i.e., many MADlib algos take array input so pivot should support array output.  Base this on how it is done in encoding categorical variables http://madlib.incubator.apache.org/docs/latest/group__grp__encode__categorical.html
> Add 'output_type' to interface:
> {code}
> pivot(
>     source_table,
>     output_table,
>     index,
>     pivot_cols,
>     pivot_values,
>     aggregate_func,
>     fill_value,
>     keep_null,
>     output_col_dictionary,
>     output_type                          -- New
>     )
> {code}
> where
> {code}
> output_type (optional)
> VARCHAR. default: 'column'. This parameter controls the output format.  If 'column', a column is created for each output variable. PostgreSQL limits the number of columns in a table. If the total number of columns exceeds the limit, then make this parameter either 'array' to combine the indicator columns into an array or 'svec' to cast the array output to 'madlib.svec' type.
> Since the array output for any single tuple would be sparse, the 'svec' output would be most efficient for storage. The 'array' output is useful if the array is used for post-processing, including concatenating with other non-categorical features.
> A dictionary will be created when 'output_type' is 'array' or 'svec' to define an index into the array. The dictionary table will be given the name of the 'output_table' appended by '_dictionary'.
> {code}
> See code in
> http://madlib.incubator.apache.org/docs/latest/group__grp__encode__categorical.html
> need to support NULL (=default 'column').  Also 'a' and 'Array' and 'arr' should be interpreted as 'array.  Same idea with 'column' and 'svec'



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)