You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by iyerr3 <gi...@git.apache.org> on 2018/07/26 23:50:13 UTC
[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...
GitHub user iyerr3 opened a pull request:
https://github.com/apache/madlib/pull/301
DT/RF: Don't eliminate single-level categorical variable
JIRA: MADLIB-1258
When DT/RF is run with grouping, a subset of the groups could eliminate
a categorical variable leading to multiple issues downstream, including
invalid importance values and incorrect prediction.
This commit keeps all categorical variables (even if it contains just
one level). This would lead to some inefficiency during tree train,
since the accumulator state would use additional space for this
categorical variable but never use it in a tree. This inefficiency is
still preferred for clean code and error-free prediction/importance
reporting.
Closes #301
Co-authored-by: Nandish Jayaram <nj...@apache.org>
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/madlib/madlib bugfix/dt_retain_cat_features
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/301.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #301
----
commit 089a4e2162a7b92dd288e5518ca8710f0aeac696
Author: Rahul Iyer <ri...@...>
Date: 2018-07-26T19:17:58Z
DT/RF: Don't eliminate single-level categorical variable
JIRA: MADLIB-1258
When DT/RF is run with grouping, a subset of the groups could eliminate
a categorical variable leading to multiple issues downstream, including
invalid importance values and incorrect prediction.
This commit keeps all categorical variables (even if it contains just
one level). This would lead to some inefficiency during tree train,
since the accumulator state would use additional space for this
categorical variable but never use it in a tree. This inefficiency is
still preferred for clean code and error-free prediction/importance
reporting.
Closes #301
Co-authored-by: Nandish Jayaram <nj...@apache.org>
----
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/612/
---
[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...
Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:
https://github.com/apache/madlib/pull/301#discussion_r205911065
--- Diff: src/ports/postgres/modules/recursive_partitioning/test/decision_tree.sql_in ---
@@ -273,7 +274,7 @@ SELECT tree_train('dt_golf'::text, -- source table
'train_output'::text, -- output model table
'id'::text, -- id column
'temperature::double precision'::text, -- response
- 'humidity, windy, "Cont_features"'::text, -- features
+ 'humidity, windy2, "Cont_features"'::text, -- features
--- End diff --
we can leave a comment here explaining why we use windy2 with all valid value 'false'
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/625/
---
[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...
Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:
https://github.com/apache/madlib/pull/301#discussion_r205910952
--- Diff: src/ports/postgres/modules/recursive_partitioning/test/decision_tree.sql_in ---
@@ -282,13 +283,16 @@ SELECT tree_train('dt_golf'::text, -- source table
6::integer, -- min split
2::integer, -- min bucket
3::integer, -- number of bins per continuous variable
- 'cp=0.01, n_folds=2' -- cost-complexity pruning parameter
+ 'cp=0.01, n_folds=2', -- cost-complexity pruning parameter
+ 'null_as_category=True'
);
SELECT _print_decision_tree(tree) from train_output;
SELECT tree_display('train_output', False);
SELECT impurity_var_importance FROM train_output;
+SELECT cat_levels_in_text, cat_n_levels, impurity_var_importance FROM train_output;
--- End diff --
Is it better to assert cat_n_levels == {1} instead of just quering?
---
[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/madlib/pull/301
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/620/
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/623/
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/613/
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/626/
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/615/
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/611/
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/629/
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/621/
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/627/
---
[GitHub] madlib issue #301: DT/RF: Don't eliminate single-level categorical variable
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/301
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/616/
---