You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by iyerr3 <gi...@git.apache.org> on 2018/06/08 01:11:33 UTC
[GitHub] madlib pull request #277: DT: Add impurity importance metric
GitHub user iyerr3 opened a pull request:
https://github.com/apache/madlib/pull/277
DT: Add impurity importance metric
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/madlib/madlib feature/dt-gini-importance
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/277.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #277
----
commit 08f4df2ec09a6726887b188ada892e362f85c3e1
Author: Nandish Jayaram <nj...@...>
Date: 2018-05-30T01:09:18Z
wip
Co-authored-by: Rahul Iyer <ri...@apache.org>
commit 4a87ba2a7cfc0a453a83fbf71356c5dfcc5f6256
Author: Nandish Jayaram <nj...@...>
Date: 2018-05-30T19:32:32Z
DT is in working condition. Yet to check correctness and add tests.
commit 64eb0014db940585f399c82d4ed514ca1a9cd4e8
Author: Nandish Jayaram <nj...@...>
Date: 2018-05-30T22:46:43Z
wip, there is an ic failure to be fixed.
commit c13b89ffdaeae4df902e4db8f86ccbf0fcbca945
Author: Rahul Iyer <ri...@...>
Date: 2018-05-31T23:53:06Z
--wip--
commit 16215182e69cd4aeaad7d357244e0ad481c6ac9f
Author: Rahul Iyer <ri...@...>
Date: 2018-06-05T19:21:09Z
Move variable importance computation to after training
commit d536c03ceee6daedd3031a06aa540508722e94ce
Author: Rahul Iyer <ri...@...>
Date: 2018-06-08T00:25:48Z
Add surrogate computation to importance + other changes
----
---
[GitHub] madlib issue #277: DT: Add impurity importance metric
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/277
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/523/
---
[GitHub] madlib issue #277: DT: Add impurity importance metric
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/277
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/502/
---
[GitHub] madlib pull request #277: DT: Add impurity importance metric
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/277#discussion_r195277278
--- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -1097,28 +1121,21 @@ def _one_step(schema_madlib, training_table_name, cat_features,
"$3", "$2",
null_proxy)
- # The arguments of the aggregate (in the same order):
- # 1. current tree state, madlib.bytea8
- # 2. categorical features (integer format) in a single array
- # 3. continuous features in a single array
- # 4. weight value
- # 5. categorical sorted levels (integer format) in a combined array
- # 6. continuous splits
- # 7. number of dependent levels
train_sql = """
SELECT (result).* from (
SELECT
- {schema_madlib}._dt_apply($1,
+ {schema_madlib}._dt_apply(
+ $1,
{schema_madlib}._compute_leaf_stats(
- $1,
- {cat_features_str},
- {con_features_str},
+ $1, -- current tree state, madlib.bytea8
+ {cat_features_str}, -- categorical features in an array
+ {con_features_str}, -- continuous features in an array
{dep_var},
- {weights},
- $2,
- $4,
- {dep_n_levels}::smallint,
- {subsample}::boolean
+ {weights}, -- weight value
+ $2, -- categorical sorted levels in a combined array
+ $4, -- continuous splits
+ {dep_n_levels}::smallint, -- number of dependent levels
+ {subsample}::boolean -- should we use a subsample of data
--- End diff --
The `$3` is part of the `cat_features_str`. I can put in a comment to that effect over here.
---
[GitHub] madlib issue #277: DT: Add impurity importance metric
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/277
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/501/
---
[GitHub] madlib issue #277: DT: Add impurity importance metric
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/277
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/512/
---
[GitHub] madlib pull request #277: DT: Add impurity importance metric
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/madlib/pull/277
---
[GitHub] madlib issue #277: DT: Add impurity importance metric
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/277
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/503/
---
[GitHub] madlib issue #277: DT: Add impurity importance metric
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/277
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/505/
---
[GitHub] madlib issue #277: DT: Add impurity importance metric
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/277
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/509/
---
[GitHub] madlib pull request #277: DT: Add impurity importance metric
Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/277#discussion_r196150565
--- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -1097,28 +1121,21 @@ def _one_step(schema_madlib, training_table_name, cat_features,
"$3", "$2",
null_proxy)
- # The arguments of the aggregate (in the same order):
- # 1. current tree state, madlib.bytea8
- # 2. categorical features (integer format) in a single array
- # 3. continuous features in a single array
- # 4. weight value
- # 5. categorical sorted levels (integer format) in a combined array
- # 6. continuous splits
- # 7. number of dependent levels
train_sql = """
SELECT (result).* from (
SELECT
- {schema_madlib}._dt_apply($1,
+ {schema_madlib}._dt_apply(
+ $1,
{schema_madlib}._compute_leaf_stats(
- $1,
- {cat_features_str},
- {con_features_str},
+ $1, -- current tree state, madlib.bytea8
+ {cat_features_str}, -- categorical features in an array
+ {con_features_str}, -- continuous features in an array
{dep_var},
- {weights},
- $2,
- $4,
- {dep_n_levels}::smallint,
- {subsample}::boolean
+ {weights}, -- weight value
+ $2, -- categorical sorted levels in a combined array
+ $4, -- continuous splits
+ {dep_n_levels}::smallint, -- number of dependent levels
+ {subsample}::boolean -- should we use a subsample of data
--- End diff --
Oh okay, thank you. I think a comment will be useful.
---
[GitHub] madlib issue #277: DT: Add impurity importance metric
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/277
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/508/
---
[GitHub] madlib issue #277: DT: Add impurity importance metric
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/277
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/511/
---
[GitHub] madlib pull request #277: DT: Add impurity importance metric
Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/277#discussion_r194918142
--- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -1097,28 +1121,21 @@ def _one_step(schema_madlib, training_table_name, cat_features,
"$3", "$2",
null_proxy)
- # The arguments of the aggregate (in the same order):
- # 1. current tree state, madlib.bytea8
- # 2. categorical features (integer format) in a single array
- # 3. continuous features in a single array
- # 4. weight value
- # 5. categorical sorted levels (integer format) in a combined array
- # 6. continuous splits
- # 7. number of dependent levels
train_sql = """
SELECT (result).* from (
SELECT
- {schema_madlib}._dt_apply($1,
+ {schema_madlib}._dt_apply(
+ $1,
{schema_madlib}._compute_leaf_stats(
- $1,
- {cat_features_str},
- {con_features_str},
+ $1, -- current tree state, madlib.bytea8
+ {cat_features_str}, -- categorical features in an array
+ {con_features_str}, -- continuous features in an array
{dep_var},
- {weights},
- $2,
- $4,
- {dep_n_levels}::smallint,
- {subsample}::boolean
+ {weights}, -- weight value
+ $2, -- categorical sorted levels in a combined array
+ $4, -- continuous splits
+ {dep_n_levels}::smallint, -- number of dependent levels
+ {subsample}::boolean -- should we use a subsample of data
--- End diff --
We only use `$1, $2, and $4` in this query. Can we remove the entry associated with `$3` (guess it refers to `bins[cat_origin]`) from prepare and execute statements that follow?
---