You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by kaknikhil <gi...@git.apache.org> on 2018/04/02 20:58:21 UTC

[GitHub] madlib pull request #254: Enable grouping for minibatch preprocessing

GitHub user kaknikhil opened a pull request:

    https://github.com/apache/madlib/pull/254

    Enable grouping for minibatch preprocessing

    This PR enables grouping for the minibatch preprocessor module.
    
    Other changes
    1. Added install check test for special chars.
    2. Improved error messages and created a reusable function for
    testing column dimension in install check.
    3. Added an optional flag to `utils_ind_var_scales_grouping` so as to
    create a persistent x_mean table that will be reused as the
    standardization table by the preprocessor module.
    4. Added unittests for `input_tbl_valid` and `output_tbl_valid` in validate_args.py_in
    5. Raise custom exception for mocked plpy error.
    
    Co-authored-by: Jingyi Mei <jm...@pivotal.io>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib feature/minibatch-preprocessing-grouping

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/254.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #254
    
----
commit 32eb0d5fd55a502eef654a6309bc15c1fbf548d0
Author: Nikhil Kak <nk...@...>
Date:   2018-03-23T18:29:07Z

    MiniBatch Pre-Processor: Add support for grouping
    
    This commit enables grouping for the minibatch preprocessor module.
    
    Other changes
    1. Added install check test for special chars.
    2. Improved error messages and created a reusable function for
    testing column dimension in install check.
    3. Add a new optional flag to utils_ind_var_scales_grouping so as to
    create a persistent x_mean table that will be reused as the
    standardization table by the preprocessor module.
    
    Co-authored-by: Jingyi Mei <jm...@pivotal.io>

commit e5be55d5ce5e23f04955f3b69ae23175b5d0d500
Author: Nikhil Kak <nk...@...>
Date:   2018-03-30T03:09:53Z

    Add unit test file for validate args
    
    This commit adds a new unittest file for the validate_args python file.
    The only two functions tested right now are input_tbl_valid and
    output_tbl_valid.

commit a4d8b69624a19d7a184a7878b1f043cf87618c4d
Author: Nikhil Kak <nk...@...>
Date:   2018-03-30T18:29:22Z

    UnitTests: Raise custom exception for mocked plpy error.
    
    Before this commit, all the unit tests that wanted to assert that
    plpy.error was called had to assert that an Exception was raised. This
    was too generic and did not distinguish between an exception coming from
    the plpy mock class vs any other exception.
    With this commit, we now raise a custom plpy exception so that we don't
    need to assert for the equality of the error messages. Asserting for the
    exception is proof enough that plpy.error was called.

----


---

[GitHub] madlib issue #254: Enable grouping for minibatch preprocessing

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/254
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/421/



---

[GitHub] madlib issue #254: Enable grouping for minibatch preprocessing

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/254
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/418/



---

[GitHub] madlib pull request #254: Enable grouping for minibatch preprocessing

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/254#discussion_r178684532
  
    --- Diff: src/ports/postgres/modules/utilities/mean_std_dev_calculator.py_in ---
    @@ -40,15 +41,27 @@ class MeanStdDevCalculator:
             self.dimension = dimension
     
         def get_mean_and_std_dev_for_ind_var(self):
    -        set_zero_std_to_one = True
    -
             x_scaled_vals = utils_ind_var_scales(self.source_table,
                                                  self.indep_var_array_str,
                                                  self.dimension,
                                                  self.schema_madlib,
    -                                             None, # do not dump the output to a temp table
    -                                             set_zero_std_to_one)
    +                                             x_mean_table = None, # do not dump the output to a temp table
    +                                             set_zero_std_to_one=True)
             x_mean_str = _array_to_string(x_scaled_vals["mean"])
             x_std_str = _array_to_string(x_scaled_vals["std"])
     
    +        if not x_mean_str or not x_std_str:
    +            plpy.error("mean/stddev for the independent variable"
    +                       "cannot be null")
    +
             return x_mean_str, x_std_str
    +
    +    def create_mean_std_table_for_ind_var_grouping(self, x_mean_table, grouping_cols):
    +        utils_ind_var_scales_grouping(self.source_table,
    +                                             self.indep_var_array_str,
    +                                             self.dimension,
    +                                             self.schema_madlib,
    +                                             grouping_cols,
    +                                             x_mean_table,
    +                                             set_zero_std_to_one = True,
    +                                             create_temp_table = False)
    --- End diff --
    
    Could you please correct the indentation here?


---

[GitHub] madlib pull request #254: Enable grouping for minibatch preprocessing

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/madlib/pull/254


---

[GitHub] madlib pull request #254: Enable grouping for minibatch preprocessing

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/254#discussion_r178684358
  
    --- Diff: src/ports/postgres/modules/convex/utils_regularization.py_in ---
    @@ -85,6 +86,8 @@ def utils_ind_var_scales_grouping(tbl_data, col_ind_var, dimension,
             x_mean_table,
             set_zero_std_to_one (optional, default is False. If set to true
                          0.0 standard deviation values will be set to 1.0)
    +        create_temp_table If set to true, create a persistent instead of a temp
    +                          table, else create a temp table for x_mean
    --- End diff --
    
    Shouldn't this comment say create temp table when true, and a persistent table when set to false?


---