You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@madlib.apache.org by iyerr3 <gi...@git.apache.org> on 2018/06/08 01:11:33 UTC

[GitHub] madlib pull request #277: DT: Add impurity importance metric

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/madlib/pull/277

    DT: Add impurity importance metric

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib feature/dt-gini-importance

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/277.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #277
    
----
commit 08f4df2ec09a6726887b188ada892e362f85c3e1
Author: Nandish Jayaram <nj...@...>
Date:   2018-05-30T01:09:18Z

    wip
    
    Co-authored-by: Rahul Iyer <ri...@apache.org>

commit 4a87ba2a7cfc0a453a83fbf71356c5dfcc5f6256
Author: Nandish Jayaram <nj...@...>
Date:   2018-05-30T19:32:32Z

    DT is in working condition. Yet to check correctness and add tests.

commit 64eb0014db940585f399c82d4ed514ca1a9cd4e8
Author: Nandish Jayaram <nj...@...>
Date:   2018-05-30T22:46:43Z

    wip, there is an ic failure to be fixed.

commit c13b89ffdaeae4df902e4db8f86ccbf0fcbca945
Author: Rahul Iyer <ri...@...>
Date:   2018-05-31T23:53:06Z

    --wip--

commit 16215182e69cd4aeaad7d357244e0ad481c6ac9f
Author: Rahul Iyer <ri...@...>
Date:   2018-06-05T19:21:09Z

    Move variable importance computation to after training

commit d536c03ceee6daedd3031a06aa540508722e94ce
Author: Rahul Iyer <ri...@...>
Date:   2018-06-08T00:25:48Z

    Add surrogate computation to importance + other changes

----


---

[GitHub] madlib issue #277: DT: Add impurity importance metric

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/277
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/523/



---

[GitHub] madlib issue #277: DT: Add impurity importance metric

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/277
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/502/



---

[GitHub] madlib pull request #277: DT: Add impurity importance metric

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/277#discussion_r195277278
  
    --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
    @@ -1097,28 +1121,21 @@ def _one_step(schema_madlib, training_table_name, cat_features,
                                                              "$3", "$2",
                                                              null_proxy)
     
    -    # The arguments of the aggregate (in the same order):
    -    # 1. current tree state, madlib.bytea8
    -    # 2. categorical features (integer format) in a single array
    -    # 3. continuous features in a single array
    -    # 4. weight value
    -    # 5. categorical sorted levels (integer format) in a combined array
    -    # 6. continuous splits
    -    # 7. number of dependent levels
         train_sql = """
             SELECT (result).* from (
                 SELECT
    -                {schema_madlib}._dt_apply($1,
    +                {schema_madlib}._dt_apply(
    +                    $1,
                         {schema_madlib}._compute_leaf_stats(
    -                        $1,
    -                        {cat_features_str},
    -                        {con_features_str},
    +                        $1,                  -- current tree state, madlib.bytea8
    +                        {cat_features_str},  -- categorical features in an array
    +                        {con_features_str},  -- continuous features in an array
                             {dep_var},
    -                        {weights},
    -                        $2,
    -                        $4,
    -                        {dep_n_levels}::smallint,
    -                        {subsample}::boolean
    +                        {weights},           -- weight value
    +                        $2,                  -- categorical sorted levels in a combined array
    +                        $4,                  -- continuous splits
    +                        {dep_n_levels}::smallint, -- number of dependent levels
    +                        {subsample}::boolean  -- should we use a subsample of data
    --- End diff --
    
    The `$3` is part of the `cat_features_str`. I can put in a comment to that effect over here. 


---

[GitHub] madlib issue #277: DT: Add impurity importance metric

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/277
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/501/



---

[GitHub] madlib issue #277: DT: Add impurity importance metric

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/277
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/512/



---

[GitHub] madlib pull request #277: DT: Add impurity importance metric

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/madlib/pull/277


---

[GitHub] madlib issue #277: DT: Add impurity importance metric

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/277
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/503/



---

[GitHub] madlib issue #277: DT: Add impurity importance metric

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/277
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/505/



---

[GitHub] madlib issue #277: DT: Add impurity importance metric

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/277
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/509/



---

[GitHub] madlib pull request #277: DT: Add impurity importance metric

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/277#discussion_r196150565
  
    --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
    @@ -1097,28 +1121,21 @@ def _one_step(schema_madlib, training_table_name, cat_features,
                                                              "$3", "$2",
                                                              null_proxy)
     
    -    # The arguments of the aggregate (in the same order):
    -    # 1. current tree state, madlib.bytea8
    -    # 2. categorical features (integer format) in a single array
    -    # 3. continuous features in a single array
    -    # 4. weight value
    -    # 5. categorical sorted levels (integer format) in a combined array
    -    # 6. continuous splits
    -    # 7. number of dependent levels
         train_sql = """
             SELECT (result).* from (
                 SELECT
    -                {schema_madlib}._dt_apply($1,
    +                {schema_madlib}._dt_apply(
    +                    $1,
                         {schema_madlib}._compute_leaf_stats(
    -                        $1,
    -                        {cat_features_str},
    -                        {con_features_str},
    +                        $1,                  -- current tree state, madlib.bytea8
    +                        {cat_features_str},  -- categorical features in an array
    +                        {con_features_str},  -- continuous features in an array
                             {dep_var},
    -                        {weights},
    -                        $2,
    -                        $4,
    -                        {dep_n_levels}::smallint,
    -                        {subsample}::boolean
    +                        {weights},           -- weight value
    +                        $2,                  -- categorical sorted levels in a combined array
    +                        $4,                  -- continuous splits
    +                        {dep_n_levels}::smallint, -- number of dependent levels
    +                        {subsample}::boolean  -- should we use a subsample of data
    --- End diff --
    
    Oh okay, thank you. I think a comment will be useful.


---

[GitHub] madlib issue #277: DT: Add impurity importance metric

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/277
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/508/



---

[GitHub] madlib issue #277: DT: Add impurity importance metric

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/277
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/511/



---

[GitHub] madlib pull request #277: DT: Add impurity importance metric

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/277#discussion_r194918142
  
    --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
    @@ -1097,28 +1121,21 @@ def _one_step(schema_madlib, training_table_name, cat_features,
                                                              "$3", "$2",
                                                              null_proxy)
     
    -    # The arguments of the aggregate (in the same order):
    -    # 1. current tree state, madlib.bytea8
    -    # 2. categorical features (integer format) in a single array
    -    # 3. continuous features in a single array
    -    # 4. weight value
    -    # 5. categorical sorted levels (integer format) in a combined array
    -    # 6. continuous splits
    -    # 7. number of dependent levels
         train_sql = """
             SELECT (result).* from (
                 SELECT
    -                {schema_madlib}._dt_apply($1,
    +                {schema_madlib}._dt_apply(
    +                    $1,
                         {schema_madlib}._compute_leaf_stats(
    -                        $1,
    -                        {cat_features_str},
    -                        {con_features_str},
    +                        $1,                  -- current tree state, madlib.bytea8
    +                        {cat_features_str},  -- categorical features in an array
    +                        {con_features_str},  -- continuous features in an array
                             {dep_var},
    -                        {weights},
    -                        $2,
    -                        $4,
    -                        {dep_n_levels}::smallint,
    -                        {subsample}::boolean
    +                        {weights},           -- weight value
    +                        $2,                  -- categorical sorted levels in a combined array
    +                        $4,                  -- continuous splits
    +                        {dep_n_levels}::smallint, -- number of dependent levels
    +                        {subsample}::boolean  -- should we use a subsample of data
    --- End diff --
    
    We only use `$1, $2, and $4` in this query. Can we remove the entry associated with `$3` (guess it refers to `bins[cat_origin]`) from prepare and execute statements that follow?


---