You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by iyerr3 <gi...@git.apache.org> on 2018/07/26 23:50:13 UTC

[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/madlib/pull/301

    DT/RF: Don't eliminate single-level categorical variable

    JIRA: MADLIB-1258
    
    When DT/RF is run with grouping, a subset of the groups could eliminate
    a categorical variable leading to multiple issues downstream, including
    invalid importance values and incorrect prediction.
    
    This commit keeps all categorical variables (even if it contains just
    one level). This would lead to some inefficiency during tree train,
    since the accumulator state would use additional space for this
    categorical variable but never use it in a tree. This inefficiency is
    still preferred for clean code and error-free prediction/importance
    reporting.
    
    Closes #301
    
    Co-authored-by: Nandish Jayaram <nj...@apache.org>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib bugfix/dt_retain_cat_features

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/301.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #301
    
----
commit 089a4e2162a7b92dd288e5518ca8710f0aeac696
Author: Rahul Iyer <ri...@...>
Date:   2018-07-26T19:17:58Z

    DT/RF: Don't eliminate single-level categorical variable
    
    JIRA: MADLIB-1258
    
    When DT/RF is run with grouping, a subset of the groups could eliminate
    a categorical variable leading to multiple issues downstream, including
    invalid importance values and incorrect prediction.
    
    This commit keeps all categorical variables (even if it contains just
    one level). This would lead to some inefficiency during tree train,
    since the accumulator state would use additional space for this
    categorical variable but never use it in a tree. This inefficiency is
    still preferred for clean code and error-free prediction/importance
    reporting.
    
    Closes #301
    
    Co-authored-by: Nandish Jayaram <nj...@apache.org>

----


---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/612/



---

[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/301#discussion_r205911065
  
    --- Diff: src/ports/postgres/modules/recursive_partitioning/test/decision_tree.sql_in ---
    @@ -273,7 +274,7 @@ SELECT tree_train('dt_golf'::text,         -- source table
                              'train_output'::text,    -- output model table
                              'id'::text,              -- id column
                              'temperature::double precision'::text,           -- response
    -                         'humidity, windy, "Cont_features"'::text,   -- features
    +                         'humidity, windy2, "Cont_features"'::text,   -- features
    --- End diff --
    
    we can leave a comment here explaining why we use windy2 with all valid value 'false'


---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/625/



---

[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/301#discussion_r205910952
  
    --- Diff: src/ports/postgres/modules/recursive_partitioning/test/decision_tree.sql_in ---
    @@ -282,13 +283,16 @@ SELECT tree_train('dt_golf'::text,         -- source table
                              6::integer,        -- min split
                              2::integer,        -- min bucket
                              3::integer,        -- number of bins per continuous variable
    -                         'cp=0.01, n_folds=2'          -- cost-complexity pruning parameter
    +                         'cp=0.01, n_folds=2',          -- cost-complexity pruning parameter
    +                         'null_as_category=True'
                              );
     
     SELECT _print_decision_tree(tree) from train_output;
     SELECT tree_display('train_output', False);
     SELECT impurity_var_importance FROM train_output;
    +SELECT cat_levels_in_text, cat_n_levels, impurity_var_importance FROM train_output;
    --- End diff --
    
    Is it better to assert cat_n_levels == {1} instead of just quering?


---

[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/madlib/pull/301


---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/620/



---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/623/



---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/613/



---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/626/



---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/615/



---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/611/



---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/629/



---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/621/



---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/627/



---

[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/301
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/616/



---