You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2016/08/17 22:20:21 UTC
[jira] [Comment Edited] (MADLIB-908) Pivoting - Phase 1 (basic pivot)

    [ https://issues.apache.org/jira/browse/MADLIB-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425500#comment-15425500 ] 

Frank McQuillan edited comment on MADLIB-908 at 8/17/16 10:19 PM:
------------------------------------------------------------------

On this data set

{code}
id | id2 | piv | piv2 | val | val2 
----+-----+-----+------+-----+------
  0 |   0 |  10 |    0 |   1 |   11
  0 |   1 |  10 |  100 |   2 |   12
  0 |   1 |  20 |  100 |   3 |   13
  1 |   0 |  10 |    0 |     |    0
  1 |   2 |  20 |  100 |   4 |   14
  1 |   3 |  10 |  200 |   7 |   17
  1 |   3 |  30 |  200 |   5 |   15
  1 |   3 |  30 |  200 |   6 |   16
  1 |   4 |     |  300 |   9 |   19
    |   0 |  10 |  300 |   8 |   18
(10 rows)

DROP AGGREGATE IF EXISTS array_accum (anyelement);
CREATE AGGREGATE array_accum (anyelement)
(
    sfunc = array_append,
    stype = anyarray,
    initcond = '{}'
);
DROP TABLE IF EXISTS pivout;
SELECT madlib.pivot('pivset_ext', 'pivout', 'id', 'piv', 'val', 'array_accum');
SELECT * FROM pivout ORDER BY id;
{code}
produces
{code}
id |    val_array_accum_piv_10    |    val_array_accum_piv_20    |  val_array_accum_piv_30
----+------------------------------+------------------------------+---------------------------
  0 | {1,2,NULL}                   | {NULL,NULL,3}                | {NULL,NULL,NULL}
  1 | {NULL,NULL,NULL,7,NULL,NULL} | {4,NULL,NULL,NULL,NULL,NULL} | {NULL,5,6,NULL,NULL,NULL}
    | {8}                          | {NULL}                       | {NULL}
{code}
There seem to be additional NULL values introduced.  Kindly have a look.




was (Author: fmcquillan):
On this data set

id | id2 | piv | piv2 | val | val2 
----+-----+-----+------+-----+------
  0 |   0 |  10 |    0 |   1 |   11
  0 |   1 |  10 |  100 |   2 |   12
  0 |   1 |  20 |  100 |   3 |   13
  1 |   0 |  10 |    0 |     |    0
  1 |   2 |  20 |  100 |   4 |   14
  1 |   3 |  10 |  200 |   7 |   17
  1 |   3 |  30 |  200 |   5 |   15
  1 |   3 |  30 |  200 |   6 |   16
  1 |   4 |     |  300 |   9 |   19
    |   0 |  10 |  300 |   8 |   18
(10 rows)

DROP AGGREGATE IF EXISTS array_accum (anyelement);
CREATE AGGREGATE array_accum (anyelement)
(
    sfunc = array_append,
    stype = anyarray,
    initcond = '{}'
);
DROP TABLE IF EXISTS pivout;
SELECT madlib.pivot('pivset_ext', 'pivout', 'id', 'piv', 'val', 'array_accum');
SELECT * FROM pivout ORDER BY id;

produces

id |    val_array_accum_piv_10    |    val_array_accum_piv_20    |  val_array_accum_piv_30
----+------------------------------+------------------------------+---------------------------
  0 | {1,2,NULL}                   | {NULL,NULL,3}                | {NULL,NULL,NULL}
  1 | {NULL,NULL,NULL,7,NULL,NULL} | {4,NULL,NULL,NULL,NULL,NULL} | {NULL,5,6,NULL,NULL,NULL}
    | {8}                          | {NULL}                       | {NULL}

There seem to be additional NULL values introduced.  Kindly have a look.



> Pivoting - Phase 1 (basic pivot)
> --------------------------------
>
>                 Key: MADLIB-908
>                 URL: https://issues.apache.org/jira/browse/MADLIB-908
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>             Fix For: v1.9.1
>
>
> Story 
> As a data scientist, I want to perform *basic* pivot operation on my data, so that I can prepare it for input to predictive analytics algorithms.
> Details
> Basic pivot for this story means:
> * single index column
> * single pivot  column
> * single value column
> * single aggregate function = SUM
> * no fill value 
> PDL Tools [5] supports some of this functionality but its interface is confusing.   Pandas [2], PostgreSQL [4] and Aster [6] also have some version of pivoting that we can learn from. 
> In general, we are following Pandas ideas.
> References
> [1] Pivot table general information, like what is pivoting?
> https://en.wikipedia.org/wiki/Pivot_table
> [2] Pandas pivot tables and cross-tabulations
> http://pandas.pydata.org/pandas-docs/stable/reshaping.html#pivot-tables-and-cross-tabulations
> http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook-pivot
> http://pbpython.com/pandas-pivot-table-explained.html
> [3] GPDB pivot_sum function
> http://gpdb.docs.pivotal.io/4320/admin_guide/query.html#topic30
> [4] PostgreSQL tablefunc
> http://www.postgresql.org/docs/9.4/static/tablefunc.html
> [5] PDL tools pivoting routines
> http://pdl-tools.pa.pivotal.io/group__grp__pivot.html
> http://pdl-tools.pa.pivotal.io/group__grp__pivot01.html
> [6] Aster Pivot and Unpivot functions
> User Guide
> http://www.info.teradata.com/eDownload.cfm?itemid=122580002
> [7] PostgreSQL aggregates
> http://www.postgresql.org/docs/8.2/static/functions-aggregate.html
> [8] PostgreSQL basic statements/assignment operator, http://www.postgresql.org/docs/8.2/static/plpgsql-statements.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)