You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@madlib.apache.org by iyerr3 <gi...@git.apache.org> on 2018/03/22 23:52:19 UTC

[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/madlib/pull/248

    DT: Ensure proper quoting in grouping coalesce

    JIRA: MADLIB-1217
    
    Grouping column value is coalesced with null_proxy to get the right null
    identifier when null_as_category is True. The null_proxy needs to be
    quoted appropriately as a string.
    
    Closes #248

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyerr3/incubator-madlib bugfix/null_category_grouping

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/248.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #248
    
----
commit ffd6355334bedaf4a6d502685330500abd3735d7
Author: Rahul Iyer <ri...@...>
Date:   2018-03-22T23:48:53Z

    DT: Ensure proper quoting in grouping coalesce
    
    JIRA: MADLIB-1217
    
    Grouping column value is coalesced with null_proxy to get the right null
    identifier when null_as_category is True. The null_proxy needs to be
    quoted appropriately as a string.
    
    Closes #248

----


---

[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/248
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/407/



---

[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce

Posted by fmcquillan99 <gi...@git.apache.org>.

Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/248
  
    I checked against the examples in 
    JIRA: MADLIB-1217
    JIRA: MADLIB-1218
    and both work OK for me.  
    
    So from the fix to the functionality perspective, LGTM.  
    
    Other reviewers can comment on the software implementation, thx.
    
    
    



---

[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/248#discussion_r177503935
  
    --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
    @@ -970,16 +970,35 @@ def _get_bins_grps(
                              "one value are dropped from the tree model.")
                 cat_features = [feature for feature in cat_features if feature in use_cat_features]
     
    -        grp_to_col_to_row = dict((grp_key, dict(
    -            (row['colname'], row['levels']) for row in items))
    -            for grp_key, items in groupby(all_levels, key=itemgetter('grp_key')))
    -
    +        # grp_col_to_levels is a list of tuples (pairs) with
    +        #   first value = group value,
    +        #   second value = a dict mapping a categorical column to its levels in data
    +        #           (these levels are specific to the group and can be different
    +        #               for different groups)
    +        #   The list of tuples can be converted to a dict, but the ordering
    +        #   will be lost.
    +        # eg. grp_col_to_levels =
    +        #   [
    +        #       ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
    +        #       ('4', {'vs': [0, 1], 'cyl': [4,6]}),
    +        #       ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
    +        #   ]
    +        grp_to_col_to_levels = [
    +            (grp_key, dict((row['colname'], row['levels']) for row in items))
    +            for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))]
         if cat_features:
    -        cat_items_list = [rows[col] for col in cat_features
    -                          for grp_key, rows in grp_to_col_to_row.items() if col in rows]
    +        # Below statements collect the grp_to_col_to_levels into multiple variables
    +        # From above eg.
    +        #   cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
    +        #   cat_n = [2, 3, 2, 2, 2, 3]
    +        #   cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
    --- End diff --
    
    Yes, thanks for the catch. I'll update before merging. 


---

[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/248#discussion_r177495864
  
    --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
    @@ -970,16 +970,35 @@ def _get_bins_grps(
                              "one value are dropped from the tree model.")
                 cat_features = [feature for feature in cat_features if feature in use_cat_features]
     
    -        grp_to_col_to_row = dict((grp_key, dict(
    -            (row['colname'], row['levels']) for row in items))
    -            for grp_key, items in groupby(all_levels, key=itemgetter('grp_key')))
    -
    +        # grp_col_to_levels is a list of tuples (pairs) with
    +        #   first value = group value,
    +        #   second value = a dict mapping a categorical column to its levels in data
    +        #           (these levels are specific to the group and can be different
    +        #               for different groups)
    +        #   The list of tuples can be converted to a dict, but the ordering
    +        #   will be lost.
    +        # eg. grp_col_to_levels =
    +        #   [
    +        #       ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
    +        #       ('4', {'vs': [0, 1], 'cyl': [4,6]}),
    +        #       ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
    +        #   ]
    +        grp_to_col_to_levels = [
    +            (grp_key, dict((row['colname'], row['levels']) for row in items))
    +            for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))]
         if cat_features:
    -        cat_items_list = [rows[col] for col in cat_features
    -                          for grp_key, rows in grp_to_col_to_row.items() if col in rows]
    +        # Below statements collect the grp_to_col_to_levels into multiple variables
    +        # From above eg.
    +        #   cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
    +        #   cat_n = [2, 3, 2, 2, 2, 3]
    +        #   cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
    +        #   grp_key_cat = ['3', '4', '5']
    --- End diff --
    
    +1 for the examples, very helpful.


---

[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/madlib/pull/248


---

[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/248
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/400/



---

[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on the issue:

    https://github.com/apache/madlib/pull/248
  
    Commits 97064e2 and 2ad0bf7 will eventually be combined in ffd6355 before merging. 


---

[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/248
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/401/



---

[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/248#discussion_r177496541
  
    --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
    @@ -970,16 +970,35 @@ def _get_bins_grps(
                              "one value are dropped from the tree model.")
                 cat_features = [feature for feature in cat_features if feature in use_cat_features]
     
    -        grp_to_col_to_row = dict((grp_key, dict(
    -            (row['colname'], row['levels']) for row in items))
    -            for grp_key, items in groupby(all_levels, key=itemgetter('grp_key')))
    -
    +        # grp_col_to_levels is a list of tuples (pairs) with
    +        #   first value = group value,
    +        #   second value = a dict mapping a categorical column to its levels in data
    +        #           (these levels are specific to the group and can be different
    +        #               for different groups)
    +        #   The list of tuples can be converted to a dict, but the ordering
    +        #   will be lost.
    +        # eg. grp_col_to_levels =
    +        #   [
    +        #       ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
    +        #       ('4', {'vs': [0, 1], 'cyl': [4,6]}),
    +        #       ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
    +        #   ]
    +        grp_to_col_to_levels = [
    +            (grp_key, dict((row['colname'], row['levels']) for row in items))
    +            for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))]
         if cat_features:
    -        cat_items_list = [rows[col] for col in cat_features
    -                          for grp_key, rows in grp_to_col_to_row.items() if col in rows]
    +        # Below statements collect the grp_to_col_to_levels into multiple variables
    +        # From above eg.
    +        #   cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
    +        #   cat_n = [2, 3, 2, 2, 2, 3]
    +        #   cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
    --- End diff --
    
    Should this be `cat_origin` instead?


---

[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/248
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/398/



---

[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/248
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/399/



---

[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on the issue:

    https://github.com/apache/madlib/pull/248
  
    I've combined two independent commits into a single PR, since the changes in the commits are close by and would benefit from the same reviewer looking at both changes. 


---