You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Jingyi Mei (JIRA)" <ji...@apache.org> on 2018/03/23 21:56:00 UTC
[jira] [Commented] (MADLIB-1220) Pre-processing helper function for mini-batching - grouping

    [ https://issues.apache.org/jira/browse/MADLIB-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412159#comment-16412159 ] 

Jingyi Mei commented on MADLIB-1220:
------------------------------------

For grouping, we proposed the following implementation:
 # when packing rows from source table, we apply partition by grouping_cols for row_number() as row_id and group by row_id while doing matrix_agg, so that we pack the rows separately for different groups with each group has its own row_ids. See this query for more info. https://github.com/apache/madlib/blob/master/src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in#L128
 # apply group by for standardization. In class MiniBatchStandardizer, instead of getting x_mean_str and x_std_dev_str as an array string for the whole dataset, we call a function which saves mean and standard deviation arrays by group in a temp table, and then we call madlib.utils_normalize_data to normalize data by joining the temp table and source table on grouping column.
 # Because the temp table mentioned in step 2 contains all the info we need for standardization output table, we decided to make it a permanent table instead of temp, so that when calling grouping we don't need to create standardization output table again by doing another table scan. See this line for more info. https://github.com/apache/madlib/blob/master/src/ports/postgres/modules/convex/utils_regularization.py_in#L107

> Pre-processing helper function for mini-batching - grouping 
> ------------------------------------------------------------
>
>                 Key: MADLIB-1220
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1220
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Nikhil
>            Assignee: Nikhil
>            Priority: Major
>             Fix For: v1.14
>
>
> Related to
>  https://issues.apache.org/jira/browse/MADLIB-1200
> Story
> {{As a}}
>  data scientist
>  {{I want to}}
>  add grouping to mini-batch pre-process
>  {{so that}}
>  I can handle groups with a single operation.
> Interface
> {code:java}
> minibatch_preprocessor(	
>      source_table, -- Name of the table containing input data
>      output_table, -- Name of the output table for mini-batching
>      dependent_varname, -- Name of the dependent variable column	
>      independent_varname, -- Expression list to evaluate for the independent variables
>     grouping_cols, -- Preprocess separately by group
>     buffer_size  -- Number of source input rows to pack into batch
> )
> {code}
> where
> {code:java}
> source_table
> TEXT.  Name of the table containing input data.  Can also be a view.
> output_table
> TEXT.  Name of the output table from the preprocessor which will be used as input to algorithms that support mini-batching.
> dependent_varname
> TEXT.  Column name or expression to evaluate for the dependent variable. 
> independent_varname
> TEXT.  Column name or expression list to evaluate for the independent variable.  Will be cast to double when packing.
> grouping_cols (optional)
> TEXT, default: NULL.  An expression list used to group the input dataset into discrete groups, running one preprocessing step per group. Similar to the SQL GROUP BY clause. When this value is NULL, no grouping is used and a single preprocessing step is performed for the whole data set.
> buffer_size (optional) INTEGER, default: ???. Number of source input rows to pack into batch.
> {code}
> The output table contains the following columns:
> {code:java}
> id					INTEGER.  Unique id for packed table.
> dependent_varname 			FLOAT8[]. Packed array of dependent variables.
> independent_varname		FLOAT8[].  Packed array of independent variables.
> grouping_cols				TEXT.  Name of grouping columns.
> {code}
> A summary table named <output_table>_summary is created together with the output table. It has the following columns:
> {code:java}
> source_table    		Source table name.
> output_table			Output table name from preprocessor.
> dependent_varname   	Dependent variable.
> independent_varname 	Independent variables.
> buffer_size			Buffer size used in preprocessing step.
> dependent_vartype		“Continuous” or “Categorical”
> class_values			Class values of the dependent variable (NULL for continuous vars).
> num_rows_processed  		The total number of rows that were used in the computation.
> num_missing_rows_skipped   	The total number of rows that were skipped because of NULL values in them.
> grouping_cols   		Names of the grouping columns.
> {code}
> A standardization table named <output_table>_standardization is created together with the output table. It has the following columns:
> {code:java}
> 	<grouping_col_expression>       Group
> 	mean				Mean of independent vars by group
> 	std				Standard deviation of independent vars by group
> {code}
>  
>  Acceptance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)