You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/03/22 18:02:00 UTC

[jira] [Updated] (MADLIB-1219) RF: null_as_category=TRUE issues

     [ https://issues.apache.org/jira/browse/MADLIB-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan updated MADLIB-1219:
------------------------------------
    Summary: RF:  null_as_category=TRUE issues  (was: RF:  null_as_category=TRUE not working when variable importance used)

> RF:  null_as_category=TRUE issues
> ---------------------------------
>
>                 Key: MADLIB-1219
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1219
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Random Forest
>            Reporter: Frank McQuillan
>            Assignee: Rahul Iyer
>            Priority: Major
>             Fix For: v1.14
>
>
> (1)
> I cannot get null_as_category=TRUE to work when variable importance is used:
> {code}
> DROP TABLE IF EXISTS null_handling_example;
> CREATE TABLE null_handling_example (
>     id integer,
>     country text,
>     city text,
>     weather text,
>     response text
> );
> INSERT INTO null_handling_example VALUES
> (1,null,null,null,'a'),
> (2,'US',null,null,'b'),
> (3,'US','NY',null,'c'),
> (4,'US','NY','rainy','d');
> {code}
> RF:
> {code}
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('null_handling_example',  -- source table
>                            'train_output',    -- output model table
>                            'id',              -- id column
>                            'response',        -- response
>                            'country, weather, city',   -- features
>                            NULL,              -- exclude columns
>                            NULL,              -- grouping columns
>                            2::integer,        -- number of trees
>                            2::integer,        -- number of random features
>                            TRUE::boolean,     -- variable importance
>                            1::integer,        -- num_permutations
>                            3::integer,        -- max depth
>                            2::integer,        -- min split
>                            2::integer,        -- min bucket
>                            2::integer,        -- number of splits per continuous variable
>                            'null_as_category=TRUE'
>                            );
> {code}
> produces this error
> {code}
> ERROR:  plpy.SPIError: invalid array length
> DETAIL:  array_of_float: Size should be in [1, 1e7], 0 given
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in <module>
>     sample_ratio
>   PL/Python function "forest_train", line 609, in forest_train
>   PL/Python function "forest_train", line 1058, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {code}
> When variable importance is FALSE, it does not produce this error.
> (2) 
> is null_as_category working for RF?
> If I do get a tree trained, prediction seems wrong:
> {code}
> DROP TABLE IF EXISTS table_test;
> CREATE TABLE table_test (
>     id integer,
>     country text,
>     city text,
>     weather text,
>     expected_response text
> );
> INSERT INTO table_test VALUES
> (1,'IN','MUM','cloudy','a'),
> (2,'US','HOU','humid','b'),
> (3,'US','NY','sunny','c'),
> (4,'US','NY','rainy','d');
> DROP TABLE IF EXISTS prediction_results;
> SELECT madlib.forest_predict('train_output',
>                              'table_test',
>                              'prediction_results',
>                              'response');
> SELECT s.id, expected_response, estimated_response
> FROM prediction_results p, table_test s
> WHERE s.id = p.id ORDER BY id;
> {code}
> produces
>  id | expected_response | estimated_response 
> ----+-------------------+--------------------
>   1 | a                 | a
>   2 | b                 | a
>   3 | c                 | a
>   4 | d                 | d
> (4 rows)
> {code}
> but the same example for decision tree predicts properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)