You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/06/03 17:45:00 UTC

[jira] [Closed] (MADLIB-1236) DT: tree_predict fails if a categorical variable has been discarded

     [ https://issues.apache.org/jira/browse/MADLIB-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan closed MADLIB-1236.
-----------------------------------

> DT: tree_predict fails if a categorical variable has been discarded
> -------------------------------------------------------------------
>
>                 Key: MADLIB-1236
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1236
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Decision Tree
>            Reporter: Rahul Iyer
>            Assignee: Rahul Iyer
>            Priority: Major
>             Fix For: v1.15
>
>
> {{tree_predict}} fails if {{tree_train}} removed a categorical variable (possibly due to presence of a a single level). This is because the summary table incorrectly does not exclude the discarded categorical variable, leading to {{tree_predict}} mapping the levels of that variable with a pre-built map. This "mapping" fails when because {{tree_train}} does not include the variable in this pre-built map. 
> Repro steps with output given below. 
> {code}
> DROP TABLE IF EXISTS dt_golf CASCADE;
> CREATE TABLE dt_golf (
>     id integer NOT NULL,
>     "OUTLOOK" text,
>     temperature double precision,
>     humidity double precision,
>     "Cont_features" double precision[],
>     cat_features text[],
>     windy boolean,
>     class text
> ) ;
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,"Cont_features",cat_features, windy,class) VALUES
> (6, 'rain', NULL, 70, ARRAY[65, 70], ARRAY['a', 'b'], true, 'Don''t Play'),
> (16, 'overcast', 80, 75, ARRAY[81, 75], ARRAY['a', 'd'], false, 'Play'),
> (17, 'overcast', 60, 75, ARRAY[81, 75], ARRAY['a', 'd'], false, 'Play'),
> (18, 'overcast', 70, 75, ARRAY[81, 75], ARRAY['a', 'd'], false, 'Play');
> SELECT tree_train('dt_golf'::text,         -- source table
>                           'train_output'::text,    -- output model table
>                           'id'::text,              -- id column
>                           'temperature::double precision'::text,           -- response
>                           '"OUTLOOK", humidity, windy, cat_features'::text,   -- features
>                           NULL::text,        -- exclude columns
>                           'gini'::text,      -- split criterion
>                           'class'::text,     -- grouping
>                           NULL::text,        -- no weights
>                           10::integer,       -- max depth
>                           6::integer,        -- min split
>                           2::integer,        -- min bucket
>                           3::integer,        -- number of bins per continuous variable
>                           'cp=0.01'          -- cost-complexity pruning parameter
>                           );
> CREATE TABLE dt_golf2 as
> SELECT * FROM dt_golf
> UNION
> SELECT 15 as id, 'humid' as "OUTLOOK", 71 as temperature, 80 as humidity,
>        ARRAY[90, 90] as "Cont_features", ARRAY['b', 'c'] as cat_features,
>        true as windy, 'Don''t Play' as class;
> SELECT tree_predict('train_output', 'dt_golf2', 'predict_output');
> {code}
> Error message: 
> {code}
> psql:/tmp/madlib.88brFX/recursive_partitioning/test/decision_tree.sql_in.tmp:327: ERROR:  plpy.SPIError:           Function "_map_catlevel_to_int(text[],text[],integer[],boolean)": Invalid type conversion. Null where not expected.  (seg0 slice2 127.0.0.1:25432 pid=88213)
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "tree_predict", line 23, in <module>
>     return decision_tree.tree_predict(**globals())
>   PL/Python function "tree_predict", line 1690, in tree_predict
> PL/Python function "tree_predict"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)