You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2017/02/28 23:13:46 UTC
[jira] [Updated] (MADLIB-965) RF and DT should accept array input for feature vector

     [ https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan updated MADLIB-965:
-----------------------------------
    Summary: RF and DT should accept array input for feature vector  (was: Error message unclear when running Random Forest 'forest_train' function)

> RF and DT should accept array input for feature vector
> ------------------------------------------------------
>
>                 Key: MADLIB-965
>                 URL: https://issues.apache.org/jira/browse/MADLIB-965
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Decision Tree, Module: Random Forest
>            Reporter: Rashmi Raghu
>            Priority: Minor
>             Fix For: v1.11
>
>         Attachments: DT and RF work1.ipynb
>
>
> We were trying to test whether the RF module could handle a column containing array of features as input (instead of each feature in a separate column). The result was an error message but that message is unclear as to source of error (i.e. is it because of the array feature input column or something else). Example table, query and error can be found below:
> {quote}
> -- Executing query:
> DROP TABLE IF EXISTS dt_golf;
> CREATE TABLE dt_golf (
>     id integer NOT NULL,
>     "OUTLOOK" text,
>     temperature double precision,
>     humidity double precision,
>     windy text,
>     class text
> ) ;
> -- Executing query:
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, 'false', 'Don''t Play'),
> (2, 'sunny', 80, 90, 'true', 'Don''t Play'),
> (3, 'overcast', 83, 78, 'false', 'Play'),
> (4, 'rain', 70, 96, 'false', 'Play'),
> (5, 'rain', 68, 80, 'false', 'Play'),
> (6, 'rain', 65, 70, 'true', 'Don''t Play'),
> (7, 'overcast', 64, 65, 'true', 'Play'),
> (8, 'sunny', 72, 95, 'false', 'Don''t Play'),
> (9, 'sunny', 69, 70, 'false', 'Play'),
> (10, 'rain', 75, 80, 'false', 'Play'),
> (11, 'sunny', 75, 70, 'true', 'Play'),
> (12, 'overcast', 72, 90, 'true', 'Play'),
> (13, 'overcast', 81, 75, 'false', 'Play'),
> (14, 'rain', 71, 80, 'true', 'Don''t Play');
> DROP TABLE IF EXISTS dt_golf_array;
> CREATE TABLE dt_golf_array as 
>     select id, array[temperature, humidity] as input_array, class
>     from dt_golf
> distributed by (id);
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('dt_golf_array',         -- source table
>                            'train_output',    -- output model table
>                            'id',              -- id column
>                            'class',           -- response
>                            'input_array',   -- features
>                            NULL,              -- exclude columns
>                            NULL,              -- grouping columns
>                            20::integer,       -- number of trees
>                            1::integer,        -- number of random features
>                            TRUE::boolean,     -- variable importance
>                            1::integer,        -- num_permutations
>                            8::integer,        -- max depth
>                            3::integer,        -- min split
>                            1::integer,        -- min bucket
>                            10::integer        -- number of splits per continuous variable
>                            );
> NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
> HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
> query result with 1 row discarded.
> ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
> DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in <module>
>     sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> ********** Error **********
> ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
> SQL state: XX000
> Detail: array_of_bigint: Size should be in [1, 1e7], 0 given
> Context: Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in <module>
>     sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)