You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/03/22 18:02:00 UTC
[jira] [Updated] (MADLIB-1219) RF: null_as_category=TRUE issues
[ https://issues.apache.org/jira/browse/MADLIB-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan updated MADLIB-1219:
------------------------------------
Summary: RF: null_as_category=TRUE issues (was: RF: null_as_category=TRUE not working when variable importance used)
> RF: null_as_category=TRUE issues
> ---------------------------------
>
> Key: MADLIB-1219
> URL: https://issues.apache.org/jira/browse/MADLIB-1219
> Project: Apache MADlib
> Issue Type: Bug
> Components: Module: Random Forest
> Reporter: Frank McQuillan
> Assignee: Rahul Iyer
> Priority: Major
> Fix For: v1.14
>
>
> (1)
> I cannot get null_as_category=TRUE to work when variable importance is used:
> {code}
> DROP TABLE IF EXISTS null_handling_example;
> CREATE TABLE null_handling_example (
> id integer,
> country text,
> city text,
> weather text,
> response text
> );
> INSERT INTO null_handling_example VALUES
> (1,null,null,null,'a'),
> (2,'US',null,null,'b'),
> (3,'US','NY',null,'c'),
> (4,'US','NY','rainy','d');
> {code}
> RF:
> {code}
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('null_handling_example', -- source table
> 'train_output', -- output model table
> 'id', -- id column
> 'response', -- response
> 'country, weather, city', -- features
> NULL, -- exclude columns
> NULL, -- grouping columns
> 2::integer, -- number of trees
> 2::integer, -- number of random features
> TRUE::boolean, -- variable importance
> 1::integer, -- num_permutations
> 3::integer, -- max depth
> 2::integer, -- min split
> 2::integer, -- min bucket
> 2::integer, -- number of splits per continuous variable
> 'null_as_category=TRUE'
> );
> {code}
> produces this error
> {code}
> ERROR: plpy.SPIError: invalid array length
> DETAIL: array_of_float: Size should be in [1, 1e7], 0 given
> CONTEXT: Traceback (most recent call last):
> PL/Python function "forest_train", line 42, in <module>
> sample_ratio
> PL/Python function "forest_train", line 609, in forest_train
> PL/Python function "forest_train", line 1058, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {code}
> When variable importance is FALSE, it does not produce this error.
> (2)
> is null_as_category working for RF?
> If I do get a tree trained, prediction seems wrong:
> {code}
> DROP TABLE IF EXISTS table_test;
> CREATE TABLE table_test (
> id integer,
> country text,
> city text,
> weather text,
> expected_response text
> );
> INSERT INTO table_test VALUES
> (1,'IN','MUM','cloudy','a'),
> (2,'US','HOU','humid','b'),
> (3,'US','NY','sunny','c'),
> (4,'US','NY','rainy','d');
> DROP TABLE IF EXISTS prediction_results;
> SELECT madlib.forest_predict('train_output',
> 'table_test',
> 'prediction_results',
> 'response');
> SELECT s.id, expected_response, estimated_response
> FROM prediction_results p, table_test s
> WHERE s.id = p.id ORDER BY id;
> {code}
> produces
> id | expected_response | estimated_response
> ----+-------------------+--------------------
> 1 | a | a
> 2 | b | a
> 3 | c | a
> 4 | d | d
> (4 rows)
> {code}
> but the same example for decision tree predicts properly.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)