You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Rahul Iyer (JIRA)" <ji...@apache.org> on 2016/08/09 16:52:20 UTC

[jira] [Created] (MADLIB-1013) Add array output to create_indicator_variables

Rahul Iyer created MADLIB-1013:
----------------------------------

             Summary: Add array output to create_indicator_variables
                 Key: MADLIB-1013
                 URL: https://issues.apache.org/jira/browse/MADLIB-1013
             Project: Apache MADlib
          Issue Type: Improvement
          Components: Module: Utilities
            Reporter: Rahul Iyer


Feature request from Satoshi Nagayasu <sn...@uptime.jp>
---------------------------------------------------------------------------------------
I'm trying create_indicator_variables() to encode categorical variables.

https://madlib.incubator.apache.org/docs/latest/group__grp__data__prep.html

And I found that PostgreSQL had a limitation of maximum number of variables
in SELECT list (called target list in PostgreSQL), up to 1664.

You may see this error when you have more than 1664 categories in your variable.

spiexceptions.ProgramLimitExceeded: target lists can have at most 1664 entries

Now, I'm considering using PostgreSQL arrays to contain indicators instead of
allocating single column per category.

If create_indicator_variables() supports arrays as its output, it
allows us to deal with categorical variables which have more than 1664 categories. And of course, I would like to use the sparse vector for it to compress them.

https://madlib.incubator.apache.org/docs/latest/group__grp__svec.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)