You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Nandish Jayaram (JIRA)" <ji...@apache.org> on 2018/06/05 21:22:00 UTC

[jira] [Commented] (MADLIB-925) Improve RF output format for variable importance

    [ https://issues.apache.org/jira/browse/MADLIB-925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502522#comment-16502522 ] 

Nandish Jayaram commented on MADLIB-925:
----------------------------------------

w/ Arvind Sridhar
It might be a good idea to create a new helper function to report the various importance scores,
that is applicable to both DT and RF.
One suggestion for the interface and its output:
{code}
madlib.get_var_importance(
  tree/forest_model_table,
  output_table
)
{code}
Columns in output table <output_table>:
{code}
1. <...grouping_cols...>
2. feature
3. gini_importance::integer/float8
4. var_importance::float8 (applicable only to RF)
{code}

The column name gini_importance might change based on the outcome of another JIRA
(https://issues.apache.org/jira/browse/MADLIB-1205).


> Improve RF output format for variable importance
> ------------------------------------------------
>
>                 Key: MADLIB-925
>                 URL: https://issues.apache.org/jira/browse/MADLIB-925
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Random Forest
>            Reporter: Frank McQuillan
>            Priority: Major
>              Labels: starter
>
> As a user,
> I want to have an easier way of accessing the variable importance output from random forest so that I can understand which are the most important variables.
> Current method of getting variable importance for each variable (in a tabular format - assuming output table name is `rf_output`): 
> ```
> SELECT unnest(regexp_split_to_array(cat_features, ',')) as variable, 
>    unnest(cat_var_importance) as importance 
> FROM rf_output_group, rf_output_summary;
> ```
> This is a cumbersome query to write and has to be written twice - for categorical and for continuous features.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)