You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by iyerr3 <gi...@git.apache.org> on 2018/03/22 23:52:19 UTC
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
GitHub user iyerr3 opened a pull request:
https://github.com/apache/madlib/pull/248
DT: Ensure proper quoting in grouping coalesce
JIRA: MADLIB-1217
Grouping column value is coalesced with null_proxy to get the right null
identifier when null_as_category is True. The null_proxy needs to be
quoted appropriately as a string.
Closes #248
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/iyerr3/incubator-madlib bugfix/null_category_grouping
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/248.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #248
----
commit ffd6355334bedaf4a6d502685330500abd3735d7
Author: Rahul Iyer <ri...@...>
Date: 2018-03-22T23:48:53Z
DT: Ensure proper quoting in grouping coalesce
JIRA: MADLIB-1217
Grouping column value is coalesced with null_proxy to get the right null
identifier when null_as_category is True. The null_proxy needs to be
quoted appropriately as a string.
Closes #248
----
---
[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/248
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/407/
---
[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce
Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:
https://github.com/apache/madlib/pull/248
I checked against the examples in
JIRA: MADLIB-1217
JIRA: MADLIB-1218
and both work OK for me.
So from the fix to the functionality perspective, LGTM.
Other reviewers can comment on the software implementation, thx.
---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/248#discussion_r177503935
--- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -970,16 +970,35 @@ def _get_bins_grps(
"one value are dropped from the tree model.")
cat_features = [feature for feature in cat_features if feature in use_cat_features]
- grp_to_col_to_row = dict((grp_key, dict(
- (row['colname'], row['levels']) for row in items))
- for grp_key, items in groupby(all_levels, key=itemgetter('grp_key')))
-
+ # grp_col_to_levels is a list of tuples (pairs) with
+ # first value = group value,
+ # second value = a dict mapping a categorical column to its levels in data
+ # (these levels are specific to the group and can be different
+ # for different groups)
+ # The list of tuples can be converted to a dict, but the ordering
+ # will be lost.
+ # eg. grp_col_to_levels =
+ # [
+ # ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
+ # ('4', {'vs': [0, 1], 'cyl': [4,6]}),
+ # ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
+ # ]
+ grp_to_col_to_levels = [
+ (grp_key, dict((row['colname'], row['levels']) for row in items))
+ for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))]
if cat_features:
- cat_items_list = [rows[col] for col in cat_features
- for grp_key, rows in grp_to_col_to_row.items() if col in rows]
+ # Below statements collect the grp_to_col_to_levels into multiple variables
+ # From above eg.
+ # cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
+ # cat_n = [2, 3, 2, 2, 2, 3]
+ # cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
--- End diff --
Yes, thanks for the catch. I'll update before merging.
---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/248#discussion_r177495864
--- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -970,16 +970,35 @@ def _get_bins_grps(
"one value are dropped from the tree model.")
cat_features = [feature for feature in cat_features if feature in use_cat_features]
- grp_to_col_to_row = dict((grp_key, dict(
- (row['colname'], row['levels']) for row in items))
- for grp_key, items in groupby(all_levels, key=itemgetter('grp_key')))
-
+ # grp_col_to_levels is a list of tuples (pairs) with
+ # first value = group value,
+ # second value = a dict mapping a categorical column to its levels in data
+ # (these levels are specific to the group and can be different
+ # for different groups)
+ # The list of tuples can be converted to a dict, but the ordering
+ # will be lost.
+ # eg. grp_col_to_levels =
+ # [
+ # ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
+ # ('4', {'vs': [0, 1], 'cyl': [4,6]}),
+ # ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
+ # ]
+ grp_to_col_to_levels = [
+ (grp_key, dict((row['colname'], row['levels']) for row in items))
+ for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))]
if cat_features:
- cat_items_list = [rows[col] for col in cat_features
- for grp_key, rows in grp_to_col_to_row.items() if col in rows]
+ # Below statements collect the grp_to_col_to_levels into multiple variables
+ # From above eg.
+ # cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
+ # cat_n = [2, 3, 2, 2, 2, 3]
+ # cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
+ # grp_key_cat = ['3', '4', '5']
--- End diff --
+1 for the examples, very helpful.
---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/madlib/pull/248
---
[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/248
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/400/
---
[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the issue:
https://github.com/apache/madlib/pull/248
Commits 97064e2 and 2ad0bf7 will eventually be combined in ffd6355 before merging.
---
[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/248
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/401/
---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/248#discussion_r177496541
--- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -970,16 +970,35 @@ def _get_bins_grps(
"one value are dropped from the tree model.")
cat_features = [feature for feature in cat_features if feature in use_cat_features]
- grp_to_col_to_row = dict((grp_key, dict(
- (row['colname'], row['levels']) for row in items))
- for grp_key, items in groupby(all_levels, key=itemgetter('grp_key')))
-
+ # grp_col_to_levels is a list of tuples (pairs) with
+ # first value = group value,
+ # second value = a dict mapping a categorical column to its levels in data
+ # (these levels are specific to the group and can be different
+ # for different groups)
+ # The list of tuples can be converted to a dict, but the ordering
+ # will be lost.
+ # eg. grp_col_to_levels =
+ # [
+ # ('3', {'vs': [0, 1], 'cyl': [4,6,8]}),
+ # ('4', {'vs': [0, 1], 'cyl': [4,6]}),
+ # ('5', {'vs': [0, 1], 'cyl': [4,6,8]})
+ # ]
+ grp_to_col_to_levels = [
+ (grp_key, dict((row['colname'], row['levels']) for row in items))
+ for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))]
if cat_features:
- cat_items_list = [rows[col] for col in cat_features
- for grp_key, rows in grp_to_col_to_row.items() if col in rows]
+ # Below statements collect the grp_to_col_to_levels into multiple variables
+ # From above eg.
+ # cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]]
+ # cat_n = [2, 3, 2, 2, 2, 3]
+ # cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8]
--- End diff --
Should this be `cat_origin` instead?
---
[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/248
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/398/
---
[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:
https://github.com/apache/madlib/pull/248
Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/399/
---
[GitHub] madlib issue #248: DT: Ensure proper quoting in grouping coalesce
Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the issue:
https://github.com/apache/madlib/pull/248
I've combined two independent commits into a single PR, since the changes in the commits are close by and would benefit from the same reviewer looking at both changes.
---