You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@madlib.apache.org by njayaram2 <gi...@git.apache.org> on 2018/12/20 01:05:21 UTC

[GitHub] madlib pull request #342: Minibatch Preprocessor for Deep learning

GitHub user njayaram2 opened a pull request:

    https://github.com/apache/madlib/pull/342

    Minibatch Preprocessor for Deep learning

    The minibatch preprocessor we currently have in MADlib is bloated for DL
    tasks. This feature adds a simplified version of creating buffers, and
    divides each element of the independent array by a normalizing constant
    for standardization (which is 255.0 by default). This is standard practice
    with image data.
    
    Co-authored-by: Arvind Sridhar <as...@pivotal.io>
    Co-authored-by: Domino Valdano <dv...@pivotal.io>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib deep-learning/minibatch-preprocessor

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/342.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #342
    
----
commit c983aafcd5e31bab5dbc278178ff9e2e17942ea1
Author: Nandish Jayaram <nj...@...>
Date:   2018-12-18T01:54:42Z

    Minibatch Preprocessor for Deep learning
    
    The minibatch preprocessor we currently have in MADlib is bloated for DL
    tasks. This feature adds a simplified version of creating buffers, and
    divides each element of the independent array by a normalizing constant
    for standardization (which is 255.0 by default). This is standard practice
    with image data.
    
    Co-authored-by: Arvind Sridhar <as...@pivotal.io>
    Co-authored-by: Domino Valdano <dv...@pivotal.io>

----


---

[GitHub] madlib pull request #342: Minibatch Preprocessor for Deep learning

Posted by reductionista <gi...@git.apache.org>.

Github user reductionista commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/342#discussion_r243722566
  
    --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
    @@ -580,3 +679,82 @@ class MiniBatchDocumentation:
                 for help.
             """.format(**locals())
     # ---------------------------------------------------------------------
    +    @staticmethod
    +    def minibatch_preprocessor_dl_help(schema_madlib, message):
    +        method = "minibatch_preprocessor_dl"
    +        summary = """
    +        ----------------------------------------------------------------
    +                            SUMMARY
    +        ----------------------------------------------------------------
    +        For Deep Learning based techniques such as Convolutional Neural Nets,
    +        the input data is mostly images. These images can be represented as an
    +        array of numbers where all elements are between 0 and 255 in value.
    +        It is standard practice to divide each of these numbers by 255.0 to
    +        normalize the image data. minibatch_preprocessor() is for general
    +        use-cases, but for deep learning based use-cases we provide
    +        minibatch_preprocessor_dl() that is light-weight and is
    +        specific to image datasets.
    +
    +        The normalizing constant is parameterized, and can be specified based
    +        on the kind of image data used.
    +
    +        For more details on function usage:
    +        SELECT {schema_madlib}.{method}('usage')
    +        """.format(**locals())
    +
    +        usage = """
    +        ---------------------------------------------------------------------------
    +                                        USAGE
    +        ---------------------------------------------------------------------------
    +        SELECT {schema_madlib}.{method}(
    +            source_table,          -- TEXT. Name of the table containing input
    +                                      data.  Can also be a view
    +            output_table,          -- TEXT. Name of the output table for
    +                                      mini-batching
    +            dependent_varname,     -- TEXT. Name of the dependent variable column
    +            independent_varname,   -- TEXT. Name of the independent variable
    +                                      column
    +            buffer_size            -- INTEGER. Default computed automatically.
    +                                      Number of source input rows to pack into a buffer
    +            normalizing_const      -- DOUBLE PRECISON. Default 255.0. The
    +                                      normalizing constant to use for
    +                                      standardizing arrays in independent_varname.
    +        );
    +
    +
    +        ---------------------------------------------------------------------------
    +                                        OUTPUT
    +        ---------------------------------------------------------------------------
    +        The output table produced by MiniBatch Preprocessor contains the
    +        following columns:
    +
    +        buffer_id               -- INTEGER.  Unique id for packed table.
    +        dependent_varname       -- FLOAT8[]. Packed array of dependent variables.
    +        independent_varname     -- FLOAT8[]. Packed array of independent
    +                                   variables.
    +
    --- End diff --
    
    Assuming my previous suggestion is taken, I would write {dependent_varname} and {independent_varname} here to distinguish from the columns in the summary table, which are literal strings rather than references to the parameters the user passes in.


---

[GitHub] madlib pull request #342: Minibatch Preprocessor for Deep learning

Posted by reductionista <gi...@git.apache.org>.

Github user reductionista commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/342#discussion_r243694232
  
    --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
    @@ -51,6 +51,105 @@ m4_changequote(`<!', `!>')
     MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
     MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
     
    +class MiniBatchPreProcessorDL:
    +    def __init__(self, schema_madlib, source_table, output_table,
    +                 dependent_varname, independent_varname, buffer_size,
    +                 normalizing_const, **kwargs):
    +        self.schema_madlib = schema_madlib
    +        self.source_table = source_table
    +        self.output_table = output_table
    +        self.dependent_varname = dependent_varname
    +        self.independent_varname = independent_varname
    +        self.buffer_size = buffer_size
    +        self.normalizing_const = normalizing_const
    +        self.module_name = "minibatch_preprocessor_DL"
    +        self.output_summary_table = add_postfix(self.output_table, "_summary")
    +        self._validate_args()
    +        self.num_of_buffers = self._get_num_buffers()
    +
    +    def minibatch_preprocessor_dl(self):
    +        norm_tbl = unique_string(desp='normalized')
    +        # Create a temp table that has independent var normalized.
    +        scalar_mult_sql = """
    +            CREATE TEMP TABLE {norm_tbl} AS
    +            SELECT {self.schema_madlib}.array_scalar_mult(
    +                {self.independent_varname}::REAL[], (1/{self.normalizing_const})::REAL) AS x_norm,
    +                {self.dependent_varname} AS y,
    +                row_number() over() AS row_id
    +            FROM {self.source_table}
    +        """.format(**locals())
    +        plpy.execute(scalar_mult_sql)
    +        # Create the mini-batched output table
    +        if is_platform_pg():
    +            distributed_by_clause = ''
    +        else:
    +            distributed_by_clause= ' DISTRIBUTED BY (buffer_id) '
    +        sql = """
    +            CREATE TABLE {self.output_table} AS
    +            SELECT * FROM
    +            (
    +                SELECT {self.schema_madlib}.agg_array_concat(
    +                    ARRAY[{norm_tbl}.x_norm::REAL[]]) AS {x},
    +                    array_agg({norm_tbl}.y) AS {y},
    +                    ({norm_tbl}.row_id%{self.num_of_buffers})::smallint AS buffer_id
    +                FROM {norm_tbl}
    +                GROUP BY buffer_id
    +            ) b
    +            {distributed_by_clause}
    +        """.format(x=MINIBATCH_OUTPUT_INDEPENDENT_COLNAME,
    +                   y=MINIBATCH_OUTPUT_DEPENDENT_COLNAME,
    --- End diff --
    
    I don't think we should change the names of these columns while batching tables.  IMO, the column names in the output table should remain the same as whatever they were in the input table; with the extra column "buffer_id" added.  eg, if they were x and y to begin with, they should remain x and y.  If they were features and labels, they should remain features and labels.
    
    In other words, instead of the literal strings "independent_varname" and "dependent_varname", the names should be self.independent_varname and self.dependent_varname, as specified by the user.
    
    If there is some reason why we want to force the user to always use the same column names in a batched table, then I'd suggest instead calling them either x and y, or independent_var and dependent_var.  Naming them "independent_varname" and "dependent_varname" is problematic for at least two reasons:
    1. These columns contain numeric data, not variable names.  So the column heading would not reflect what is in the column.
    and 
    2. These column names conflict with the column names in the summary table, which actually do refer to variable names.  It also conflicts with the names of the parameters being passed in by the user.  I think users will be very confused if we give two different things exactly the same name in the same function.


---

[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/342
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/719/



---

[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning

Posted by reductionista <gi...@git.apache.org>.

Github user reductionista commented on the issue:

    https://github.com/apache/madlib/pull/342
  
    I've added an optional dependent_offset parameter to shift the dependent_var values if desired.  
    
    Also noticed a minor documentation issue in .py_in; fixed to match documentation in .sql_in (and actual behavior).


---

[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/342
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/729/



---

[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/342
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/718/



---

[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/342
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/728/



---

[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on the issue:

https://github.com/apache/madlib/pull/342

@reductionista thank you for the comments.
The existing `minibatch_preprocessor` module outputs new columns called `dependent_varname` and `independent_varname` instead of the column names from the input table. The reason we did the same here is purely to conform with what is already in the other module. The other module allows expressions as input params (which may have been the reason behind a different column name in its output table), while this module does not explicitly support expressions. So, I do agree with your point about the output table column names, but I am just not sure how odd it would be to have the difference between the two modules. May be other folks could also weigh in to help us decide. Also, this module (`minibatch_preprocessor_dl`) is at early stage dev, so this will be a great time to try out options.

Regarding your comment on the ordering of the two input params (`x` and `y`):
This is following the convention we have in every other MADlib module, namely, we first have the dependent variable followed by the independent variable in the input parameters list. If you'd like it to be the opposite, it might be a good idea to start a separate thread in the community mailing list to discuss it. It will break conformity if we change the order of the two variables only in this module. BTW, `2.0` release will be a good time to change it since that release would break backward compatibility.

---

[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/342
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/717/



---

[GitHub] madlib issue #342: Minibatch Preprocessor for Deep learning

Posted by fmcquillan99 <gi...@git.apache.org>.

Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/342
  
    https://issues.apache.org/jira/browse/MADLIB-1290
    associated JIRA


---