You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@madlib.apache.org by nk...@apache.org on 2021/02/09 21:06:47 UTC

[madlib] branch master updated (f29674b -> a0f711c)

This is an automated email from the ASF dual-hosted git repository.

nkak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git.


    from f29674b  move notes to bottom of page for consistency in user docs
     new b00750b  DL: remove unused rotate import
     new bdc67ec  DL: Fix validation in fit, fit multiple, evaluate and predict
     new fe42e7f  DL: Fix misc bugs
     new a0f711c  DL: Cleanup fit and fit_multiple

The 4 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../modules/deep_learning/madlib_keras.py_in       | 162 ++++++--------
 .../madlib_keras_fit_multiple_model.py_in          | 108 +++-------
 .../deep_learning/madlib_keras_predict.py_in       |   7 +-
 .../deep_learning/madlib_keras_validator.py_in     | 237 ++++++++++-----------
 .../test/madlib_keras_evaluate.sql_in              |   9 +
 .../deep_learning/test/madlib_keras_fit.sql_in     |  66 ++++++
 .../test/madlib_keras_model_selection.sql_in       |  37 ++++
 .../test/madlib_keras_multi_io.sql_in              |  25 +++
 .../deep_learning/test/madlib_keras_predict.sql_in |  20 ++
 .../test/madlib_keras_predict_byom.sql_in          |  27 +++
 .../test/unit_tests/test_madlib_keras.py_in        |  84 +++++---
 .../postgres/modules/utilities/utilities.sql_in    |  26 +++
 12 files changed, 481 insertions(+), 327 deletions(-)

[madlib] 02/04: DL: Fix validation in fit, fit multiple, evaluate and predict

Posted by nk...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

nkak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git

commit bdc67ec12f0263deaac5e2728f0c01521bd3b9ea
Author: Nikhil Kak <nk...@vmware.com>
AuthorDate: Fri Jan 22 16:43:01 2021 -0800

    DL: Fix validation in fit, fit multiple, evaluate and predict
    
    JIRA: MADLIB-1464
    
    Previously while calling fit/fit_multiple/evaluate/predict with invalid
    input and output tables (null or missing), we would print the wrong
    error message. This commit refactors the code so that we print the
    expected error message.
    
    Refactored the validator code such that we don't need to create the info
    and summary table names in the fit multiple class. Instead we do that in
    the validator and then the validator object can be used to get the table
    names. This makes it easier to validate all the tables inside the
    validator class.  This commit also refactors the code so that we move
    all the validation code inside the validator class except for the source
    table validation since that needs to be validated before we call the
    get_data_distribution_per_segment function which has to be called before
    the validator constructor.
    
    To test this, we created a plpython function that asserts that the query
    failed with the expected error message. Added a couple of wrapper
    function on top of this function that test for null input and output tables.
    
    Co-authored-by: Ekta Khanna <ek...@vmware.com>
---
 .../modules/deep_learning/madlib_keras.py_in       |  63 ++----
 .../madlib_keras_fit_multiple_model.py_in          |  82 +++-----
 .../deep_learning/madlib_keras_predict.py_in       |   3 +-
 .../deep_learning/madlib_keras_validator.py_in     | 222 ++++++++++-----------
 .../test/madlib_keras_evaluate.sql_in              |   9 +
 .../deep_learning/test/madlib_keras_fit.sql_in     |  42 ++++
 .../test/madlib_keras_model_selection.sql_in       |  37 ++++
 .../test/madlib_keras_multi_io.sql_in              |  25 +++
 .../deep_learning/test/madlib_keras_predict.sql_in |  20 ++
 .../test/madlib_keras_predict_byom.sql_in          |  27 +++
 .../test/unit_tests/test_madlib_keras.py_in        |  33 ++-
 .../postgres/modules/utilities/utilities.sql_in    |  26 +++
 12 files changed, 355 insertions(+), 234 deletions(-)

diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.py_in b/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
index 49892b6..c4f8611 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
@@ -103,6 +103,7 @@ def fit(schema_madlib, source_table, model, model_arch_table,
     fit_params = "" if not fit_params else fit_params
     _assert(compile_params, "Compile parameters cannot be empty or NULL.")
 
+    input_tbl_valid(source_table, module_name)
     segments_per_host = get_data_distribution_per_segment(source_table)
     use_gpus = use_gpus if use_gpus else False
     if use_gpus:
@@ -114,51 +115,27 @@ def fit(schema_madlib, source_table, model, model_arch_table,
 
     if object_table is not None:
         object_table = "{0}.{1}".format(schema_madlib, quote_ident(object_table))
-
-    source_summary_table = add_postfix(source_table, "_summary")
-    input_tbl_valid(source_summary_table, module_name)
-    src_summary_dict = get_source_summary_table_dict(source_summary_table)
-
-    columns_dict = {}
-    columns_dict['mb_dep_var_cols'] = src_summary_dict['dependent_varname']
-    columns_dict['mb_indep_var_cols'] = src_summary_dict['independent_varname']
-    columns_dict['dep_shape_cols'] = [add_postfix(i, "_shape") for i in columns_dict['mb_dep_var_cols']]
-    columns_dict['ind_shape_cols'] = [add_postfix(i, "_shape") for i in columns_dict['mb_indep_var_cols']]
-
-    multi_dep_count = len(columns_dict['mb_dep_var_cols'])
-    val_dep_var = None
-    val_ind_var = None
-
-    val_dep_shape_cols = None
-    val_ind_shape_cols = None
-    if validation_table:
-        validation_summary_table = add_postfix(validation_table, "_summary")
-        input_tbl_valid(validation_summary_table, module_name)
-        val_summary_dict = get_source_summary_table_dict(validation_summary_table)
-
-        val_dep_var = val_summary_dict['dependent_varname']
-        val_ind_var = val_summary_dict['independent_varname']
-        val_dep_shape_cols = [add_postfix(i, "_shape") for i in val_dep_var]
-        val_ind_shape_cols = [add_postfix(i, "_shape") for i in val_ind_var]
-
     fit_validator = FitInputValidator(
         source_table, validation_table, model, model_arch_table, model_id,
-        columns_dict['mb_dep_var_cols'], columns_dict['mb_indep_var_cols'],
-        columns_dict['dep_shape_cols'], columns_dict['ind_shape_cols'],
         num_iterations, metrics_compute_frequency, warm_start,
-        use_gpus, accessible_gpus_for_seg, object_table,
-        val_dep_var, val_ind_var)
-
-    columns_dict['val_dep_var'] = val_dep_var
-    columns_dict['val_ind_var'] = val_ind_var
-    columns_dict['val_dep_shape_cols'] = val_dep_shape_cols
-    columns_dict['val_ind_shape_cols'] = val_ind_shape_cols
-
-    fit_validator.dependent_varname = columns_dict['mb_dep_var_cols']
-    fit_validator.independent_varname = columns_dict['mb_indep_var_cols']
-    fit_validator.dep_shape_col = columns_dict['dep_shape_cols']
-    fit_validator.ind_shape_col = columns_dict['ind_shape_cols']
+        use_gpus, accessible_gpus_for_seg, object_table)
 
+    columns_dict = {}
+    columns_dict['mb_dep_var_cols'] = fit_validator.dependent_varname
+    columns_dict['mb_indep_var_cols'] = fit_validator.independent_varname
+    columns_dict['dep_shape_cols'] = fit_validator.dep_shape_cols
+    columns_dict['ind_shape_cols'] = fit_validator.ind_shape_cols
+    columns_dict['val_dep_var'] = fit_validator.val_dep_var
+    columns_dict['val_ind_var'] = fit_validator.val_ind_var
+    columns_dict['val_dep_shape_cols'] = fit_validator.val_dep_shape_cols
+    columns_dict['val_ind_shape_cols'] = fit_validator.val_ind_shape_cols
+    multi_dep_count = len(fit_validator.dependent_varname)
+
+    # fit_validator.dependent_varname = columns_dict['mb_dep_var_cols']
+    # fit_validator.independent_varname = columns_dict['mb_indep_var_cols']
+    # fit_validator.dep_shape_col = columns_dict['dep_shape_cols']
+    # fit_validator.ind_shape_col = columns_dict['ind_shape_cols']
+    src_summary_dict = fit_validator.src_summary_dict
     class_values_colnames = [add_postfix(i, "_class_values") for i in columns_dict['mb_dep_var_cols']]
     src_summary_dict['class_values_type'] =[ get_expr_type(
         i, fit_validator.source_summary_table) for i in class_values_colnames]
@@ -446,6 +423,7 @@ def fit(schema_madlib, source_table, model, model_arch_table,
                    normalizing_const_colname=NORMALIZING_CONST_COLNAME,
                    FLOAT32_SQL_TYPE = FLOAT32_SQL_TYPE,
                    model_id_colname = ModelArchSchema.MODEL_ID,
+                   source_summary_table=fit_validator.source_summary_table,
                    **locals()),
                    ["TEXT", "TEXT", "TEXT", "TEXT", "DOUBLE PRECISION[]"])
     plpy.execute(create_output_summary_table,
@@ -867,6 +845,7 @@ def evaluate(schema_madlib, model_table, test_table, output_table,
 
     module_name = 'madlib_keras_evaluate'
     is_mult_model = mst_key is not None
+    test_summary_table = None
     if test_table:
         test_summary_table = add_postfix(test_table, "_summary")
     model_summary_table = None
@@ -874,6 +853,7 @@ def evaluate(schema_madlib, model_table, test_table, output_table,
         model_summary_table = add_postfix(model_table, "_summary")
 
     mult_where_clause = ""
+    input_tbl_valid(model_table, module_name)
     if is_mult_model:
         mult_where_clause = "WHERE mst_key = {0}".format(mst_key)
         model_summary_table = create_summary_view(module_name, model_table, mst_key)
@@ -1035,7 +1015,6 @@ def get_loss_metric_from_keras_eval(schema_madlib, table, columns_dict, compile_
         weights = '$1'
         mult_sql = ''
         custom_map_var = '$2'
-        plpy.info(eval_sql.format(**locals()))
         evaluate_query = plpy.prepare(eval_sql.format(**locals()), ["bytea", "bytea"])
         res = plpy.execute(evaluate_query, [serialized_weights, custom_function_map])
 
diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in
index deda8f6..22b9401 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in
@@ -45,6 +45,7 @@ from utilities.utilities import unique_string
 from utilities.utilities import madlib_version
 from utilities.utilities import is_platform_pg
 from utilities.utilities import get_seg_number
+from utilities.validate_args import input_tbl_valid
 import utilities.debug as DEBUG
 from utilities.debug import plpy_prepare
 from utilities.debug import plpy_execute
@@ -110,8 +111,6 @@ class FitMultipleModel(object):
         self.source_table = source_table
         self.validation_table = validation_table
         self.model_selection_table = model_selection_table
-        if self.model_selection_table:
-            self.model_selection_summary_table = add_postfix(self.model_selection_table, '_summary')
 
         self.dist_key_col = DISTRIBUTION_KEY_COLNAME
         self.prev_dist_key_col = '__prev_dist_key__'
@@ -134,40 +133,6 @@ class FitMultipleModel(object):
         self.train_mst_loss = defaultdict(list)
         self.train_mst_metric = defaultdict(list)
         self.info_str = ""
-        source_summary_table = add_postfix(self.source_table, "_summary")
-        input_tbl_valid(source_summary_table, self.module_name)
-        src_summary_dict = get_source_summary_table_dict(source_summary_table)
-
-        self.mb_dep_var_cols = src_summary_dict['dependent_varname']
-        self.mb_indep_var_cols = src_summary_dict['independent_varname']
-        self.dep_shape_cols = [add_postfix(i, "_shape") for i in self.mb_dep_var_cols]
-        self.ind_shape_cols = [add_postfix(i, "_shape") for i in self.mb_indep_var_cols]
-
-        self.columns_dict = {}
-        self.columns_dict['mb_dep_var_cols'] = self.mb_dep_var_cols
-        self.columns_dict['mb_indep_var_cols'] = self.mb_indep_var_cols
-        self.columns_dict['dep_shape_cols'] = self.dep_shape_cols
-        self.columns_dict['ind_shape_cols'] = self.ind_shape_cols
-
-        self.val_dep_var = None
-        self.val_ind_var = None
-        self.val_dep_shape_cols = None
-        self.val_ind_shape_cols = None
-        if validation_table:
-            validation_summary_table = add_postfix(self.validation_table, "_summary")
-            input_tbl_valid(validation_summary_table, self.module_name)
-            val_summary_dict = get_source_summary_table_dict(validation_summary_table)
-
-            self.val_dep_var = val_summary_dict['dependent_varname']
-            self.val_ind_var = val_summary_dict['independent_varname']
-            self.val_dep_shape_cols = [add_postfix(i, "_shape") for i in self.val_dep_var]
-            self.val_ind_shape_cols = [add_postfix(i, "_shape") for i in self.val_ind_var]
-
-        self.columns_dict['val_dep_var'] = self.val_dep_var
-        self.columns_dict['val_ind_var'] = self.val_ind_var
-        self.columns_dict['val_dep_shape_cols'] = self.val_dep_shape_cols
-        self.columns_dict['val_ind_shape_cols'] = self.val_ind_shape_cols
-
         self.use_gpus = use_gpus if use_gpus else False
         self.model_input_tbl = unique_string('model_input')
         self.model_output_tbl = unique_string('model_output')
@@ -178,6 +143,7 @@ class FitMultipleModel(object):
         self.rotate_schedule_tbl_plan = self.add_object_maps_plan = None
         self.hop_plan = self.udf_plan = None
 
+        input_tbl_valid(self.source_table, self.module_name)
         self.segments_per_host = get_data_distribution_per_segment(source_table)
         if self.use_gpus:
             self.accessible_gpus_for_seg = get_accessible_gpus_for_seg(
@@ -186,30 +152,32 @@ class FitMultipleModel(object):
             self.accessible_gpus_for_seg = get_seg_number()*[0]
 
         self.original_model_output_tbl = model_output_table
-        if not self.original_model_output_tbl:
-            plpy.error("Must specify an output table.")
-
-        self.model_info_tbl = add_postfix(
-            self.original_model_output_tbl, '_info')
-        self.model_summary_table = add_postfix(
-            self.original_model_output_tbl, '_summary')
-
         self.warm_start = bool(warm_start)
 
         self.fit_validator_train = FitMultipleInputValidator(
             self.source_table, self.validation_table, self.original_model_output_tbl,
-            self.model_selection_table, self.model_selection_summary_table,
-            self.mb_dep_var_cols, self.mb_indep_var_cols, self.dep_shape_cols,
-            self.ind_shape_cols, self.num_iterations,
-            self.model_info_tbl, self.mst_key_col, self.model_arch_table_col,
-            self.metrics_compute_frequency, self.warm_start, self.use_gpus,
-            self.accessible_gpus_for_seg, self.val_dep_var, self.val_ind_var)
+            self.model_selection_table, self.num_iterations, self.mst_key_col,
+            self.model_arch_table_col, self.metrics_compute_frequency,
+            self.warm_start, self.use_gpus, self.accessible_gpus_for_seg)
+        self.model_info_tbl = self.fit_validator_train.output_model_info_table
+        self.model_summary_table = self.fit_validator_train.output_summary_model_table
+        self.model_selection_summary_table = self.fit_validator_train.model_selection_summary_table
         if self.metrics_compute_frequency is None:
             self.metrics_compute_frequency = num_iterations
 
         self.msts = self.fit_validator_train.msts
         self.model_arch_table = self.fit_validator_train.model_arch_table
         self.object_table = self.fit_validator_train.object_table
+        self.columns_dict = {}
+        self.columns_dict['mb_dep_var_cols'] = self.fit_validator_train.dependent_varname
+        self.columns_dict['mb_indep_var_cols'] = self.fit_validator_train.independent_varname
+        self.columns_dict['dep_shape_cols'] = self.fit_validator_train.dep_shape_cols
+        self.columns_dict['ind_shape_cols'] = self.fit_validator_train.ind_shape_cols
+        self.columns_dict['val_dep_var'] = self.fit_validator_train.val_dep_var
+        self.columns_dict['val_ind_var'] = self.fit_validator_train.val_ind_var
+        self.columns_dict['val_dep_shape_cols'] = self.fit_validator_train.val_dep_shape_cols
+        self.columns_dict['val_ind_shape_cols'] = self.fit_validator_train.val_ind_shape_cols
+
         self.metrics_iters = []
         self.object_map_col = 'object_map'
         self.custom_mst_keys = None
@@ -222,7 +190,7 @@ class FitMultipleModel(object):
 
         self.dist_key_mapping, self.images_per_seg_train = \
             get_image_count_per_seg_for_minibatched_data_from_db(
-                self.source_table, self.dep_shape_cols[0])
+                self.source_table, self.fit_validator_train.dep_shape_cols[0])
 
         if self.validation_table:
             self.valid_mst_metric_eval_time = defaultdict(list)
@@ -230,7 +198,7 @@ class FitMultipleModel(object):
             self.valid_mst_metric = defaultdict(list)
             self.dist_key_mapping_valid, self.images_per_seg_valid = \
                 get_image_count_per_seg_for_minibatched_data_from_db(
-                    self.validation_table, self.val_dep_shape_cols[0])
+                    self.validation_table, self.fit_validator_train.val_dep_shape_cols[0])
 
         self.dist_keys = query_dist_keys(self.source_table, self.dist_key_col)
         self.max_dist_key = sorted(self.dist_keys)[-1]
@@ -713,7 +681,7 @@ class FitMultipleModel(object):
         source_summary_table = self.fit_validator_train.source_summary_table
         src_summary_dict = get_source_summary_table_dict(source_summary_table)
 
-        class_values_colnames = [add_postfix(i, "_class_values") for i in self.mb_dep_var_cols]
+        class_values_colnames = [add_postfix(i, "_class_values") for i in self.fit_validator_train.dependent_varname]
         # class_values = src_summary_dict['class_values']
         class_values_type =[get_expr_type(i, source_summary_table) for i in class_values_colnames]
         # class_values_type = src_summary_dict['class_values_type']
@@ -897,10 +865,10 @@ class FitMultipleModel(object):
             """.format(self=self))
 
         #TODO: Fix these to add multi io
-        dep_shape_col = self.dep_shape_cols[0]
-        ind_shape_col = self.ind_shape_cols[0]
-        dep_var_col = self.mb_dep_var_cols[0]
-        indep_var_col = self.mb_indep_var_cols[0]
+        dep_shape_col = self.fit_validator_train.dep_shape_cols[0]
+        ind_shape_col = self.fit_validator_train.ind_shape_cols[0]
+        dep_var_col = self.fit_validator_train.dependent_varname[0]
+        indep_var_col = self.fit_validator_train.independent_varname[0]
         source_table = self.source_table
 
         if self.use_caching:
diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_predict.py_in b/src/ports/postgres/modules/deep_learning/madlib_keras_predict.py_in
index 053a5f9..0e5b1b9 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras_predict.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras_predict.py_in
@@ -55,6 +55,8 @@ class BasePredict():
         self.module_name = module_name
 
         self.use_gpus = use_gpus if use_gpus else False
+        input_tbl_valid(test_table, module_name)
+        input_tbl_valid(table_to_validate, module_name)
         self.segments_per_host = get_data_distribution_per_segment(test_table)
         if self.use_gpus:
             accessible_gpus_for_seg = get_accessible_gpus_for_seg(schema_madlib,
@@ -252,7 +254,6 @@ class Predict(BasePredict):
             plpy.execute("DROP VIEW IF EXISTS {0}".format(self.temp_summary_view))
 
     def validate(self):
-        input_tbl_valid(self.model_table, self.module_name)
         if self.is_mult_model and not columns_exist_in_table(self.model_table, ['mst_key']):
             plpy.error("{self.module_name}: Single model should not pass mst_key".format(**locals()))
         if not self.is_mult_model and columns_exist_in_table(self.model_table, ['mst_key']):
diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in b/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in
index 2549b84..21eff15 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in
@@ -260,20 +260,12 @@ class InputValidator:
 
 class FitCommonValidator(object):
     def __init__(self, source_table, validation_table, output_model_table,
-                 model_arch_table, model_id, dependent_varname,
-                 independent_varname, dep_shape_cols, ind_shape_cols, num_iterations,
-                 metrics_compute_frequency, warm_start,
-                 use_gpus, accessible_gpus_for_seg, module_name, object_table,
-                 val_dep_var, val_ind_var):
+                 num_iterations, metrics_compute_frequency, warm_start,
+                 use_gpus, accessible_gpus_for_seg, module_name, object_table):
         self.source_table = source_table
         self.validation_table = validation_table
         self.output_model_table = output_model_table
-        self.model_arch_table = model_arch_table
-        self.model_id = model_id
-        self.dependent_varname = dependent_varname
-        self.independent_varname = independent_varname
-        self.dep_shape_cols = dep_shape_cols
-        self.ind_shape_cols = ind_shape_cols
+
         self.metrics_compute_frequency = metrics_compute_frequency
         self.warm_start = warm_start
         self.num_iterations = num_iterations
@@ -282,42 +274,52 @@ class FitCommonValidator(object):
         if self.source_table:
             self.source_summary_table = add_postfix(
                 self.source_table, "_summary")
+        if self.validation_table:
+            self.validation_summary_table = add_postfix(
+                self.validation_table, "_summary")
         if self.output_model_table:
             self.output_summary_model_table = add_postfix(
                 self.output_model_table, "_summary")
         self.accessible_gpus_for_seg = accessible_gpus_for_seg
         self.module_name = module_name
-        self.val_dep_var = val_dep_var
-        self.val_ind_var = val_ind_var
 
-        self._validate_common_args()
+        self._validate_tables()
+
+        self.src_summary_dict = self.get_source_summary_table_dict(self.source_summary_table)
+
+        self.dependent_varname = self.src_summary_dict['dependent_varname']
+        self.independent_varname = self.src_summary_dict['independent_varname']
+        self.dep_shape_cols = [add_postfix(i, "_shape") for i in self.dependent_varname]
+        self.ind_shape_cols = [add_postfix(i, "_shape") for i in self.independent_varname]
+
+        self.val_dep_var = None
+        self.val_ind_var = None
+        self.val_dep_shape_cols = None
+        self.val_ind_shape_cols = None
+        if self.validation_table:
+            val_summary_dict = self.get_source_summary_table_dict(self.validation_summary_table)
+
+            self.val_dep_var = val_summary_dict['dependent_varname']
+            self.val_ind_var = val_summary_dict['independent_varname']
+            self.val_dep_shape_cols = [add_postfix(i, "_shape") for i in self.val_dep_var]
+            self.val_ind_shape_cols = [add_postfix(i, "_shape") for i in self.val_ind_var]
+
+        self._validate_tables_schema()
         if use_gpus:
             InputValidator._validate_gpu_config(self.module_name,
                 self.source_table, self.accessible_gpus_for_seg)
 
-    def _validate_common_args(self):
-        _assert(self.num_iterations > 0,
-            "{0}: Number of iterations cannot be < 1.".format(self.module_name))
-        _assert(self._is_valid_metrics_compute_frequency(),
-            "{0}: metrics_compute_frequency must be in the range (1 - {1}).".format(
-                self.module_name, self.num_iterations))
+    def _validate_tables(self):
         input_tbl_valid(self.source_table, self.module_name)
+        input_tbl_valid(self.source_summary_table, self.module_name)
+        if self.validation_table:
+            input_tbl_valid(self.validation_table, self.module_name)
+            input_tbl_valid(self.validation_summary_table, self.module_name)
+
         if self.object_table is not None:
             input_tbl_valid(self.object_table, self.module_name)
             cols_in_tbl_valid(self.object_table, CustomFunctionSchema.col_names, self.module_name)
 
-        cols_in_tbl_valid(self.source_summary_table,
-            [NORMALIZING_CONST_COLNAME, DEPENDENT_VARTYPE_COLNAME,
-            'dependent_varname', 'independent_varname'], self.module_name)
-        if not is_platform_pg():
-            cols_in_tbl_valid(self.source_table, [DISTRIBUTION_KEY_COLNAME], self.module_name)
-
-        # Source table and validation tables must have the same schema
-        self._validate_input_table(self.source_table)
-        for i in self.dependent_varname:
-            validate_bytea_var_for_minibatch(self.source_table, i)
-
-        self._validate_validation_table()
         if self.warm_start:
             input_tbl_valid(self.output_model_table, self.module_name)
             input_tbl_valid(self.output_summary_model_table, self.module_name)
@@ -325,60 +327,59 @@ class FitCommonValidator(object):
             output_tbl_valid(self.output_model_table, self.module_name)
             output_tbl_valid(self.output_summary_model_table, self.module_name)
 
-    def _validate_input_table(self, table, is_validation_table=False):
-
-        independent_varname = self.val_ind_var if is_validation_table else self.independent_varname
-        dependent_varname = self.val_dep_var if is_validation_table else self.dependent_varname
-
-        for name in independent_varname:
-            _assert(is_var_valid(table, name),
-                "{module_name}: invalid independent_varname "
-                "('{independent_varname}') for table ({table}). "
-                "Please ensure that the input table ({table}) "
-                "has been preprocessed by the image preprocessor.".format(
-                    module_name=self.module_name,
-                    independent_varname=name,
-                    table=table))
-
-        for name in dependent_varname:
-            _assert(is_var_valid(table, name),
-                "{module_name}: invalid dependent_varname "
-                "('{dependent_varname}') for table ({table}). "
-                "Please ensure that the input table ({table}) "
-                "has been preprocessed by the image preprocessor.".format(
-                    module_name=self.module_name,
-                    dependent_varname=name,
-                    table=table))
-        if not is_validation_table:
-            for name in self.ind_shape_cols:
-                _assert(is_var_valid(table, name),
-                    "{module_name}: invalid independent_var_shape "
-                    "('{ind_shape_col}') for table ({table}). "
-                    "Please ensure that the input table ({table}) "
-                    "has been preprocessed by the image preprocessor.".format(
-                        module_name=self.module_name,
-                        ind_shape_col=name,
-                        table=table))
-
-            for name in self.dep_shape_cols:
-                _assert(is_var_valid(table, name),
-                    "{module_name}: invalid dependent_var_shape "
-                    "('{dep_shape_col}') for table ({table}). "
-                    "Please ensure that the input table ({table}) "
-                    "has been preprocessed by the image preprocessor.".format(
-                        module_name=self.module_name,
-                        dep_shape_col=name,
-                        table=table))
 
+    def _validate_tables_schema(self):
+        # Source table and validation tables must have the same schema
+        additional_cols = []
         if not is_platform_pg():
-            _assert(is_var_valid(table, DISTRIBUTION_KEY_COLNAME),
-                    "{module_name}: missing distribution key "
-                    "('{dist_key_col}') for table ({table}). "
-                    "Please ensure that the input table ({table}) "
-                    "has been preprocessed by the image preprocessor.".format(
+            additional_cols.append(DISTRIBUTION_KEY_COLNAME)
+
+        self._validate_columns_in_preprocessed_table(self.source_table,
+                                                    self.independent_varname +
+                                                    self.dependent_varname +
+                                                    self.ind_shape_cols +
+                                                    self.dep_shape_cols +
+                                                    additional_cols)
+        for i in self.dependent_varname:
+            validate_bytea_var_for_minibatch(self.source_table, i)
+
+        if self.validation_table and self.validation_table.strip() != '':
+            self._validate_columns_in_preprocessed_table(self.validation_table,
+                                                        self.val_ind_var +
+                                                        self.val_dep_var +
+                                                        self.val_ind_shape_cols +
+                                                        self.val_dep_shape_cols+
+                                                        additional_cols)
+            for i in self.val_dep_var:
+                validate_bytea_var_for_minibatch(self.validation_table, i)
+
+        cols_in_tbl_valid(self.source_summary_table,
+                          [NORMALIZING_CONST_COLNAME, DEPENDENT_VARTYPE_COLNAME,
+                           'dependent_varname', 'independent_varname'], self.module_name)
+
+    def _validate_misc_args(self):
+        _assert(self.num_iterations > 0,
+                "{0}: Number of iterations cannot be < 1.".format(self.module_name))
+        _assert(self._is_valid_metrics_compute_frequency(),
+                "{0}: metrics_compute_frequency must be in the range (1 - {1}).".format(
+                    self.module_name, self.num_iterations))
+
+    def get_source_summary_table_dict(self, source_summary_table):
+        source_summary = plpy.execute("""
+                SELECT *
+                FROM {0}
+            """.format(source_summary_table))[0]
+        return source_summary
+
+    def _validate_columns_in_preprocessed_table(self, table_name, col_names):
+        for col in col_names:
+            _assert(is_var_valid(table_name, col),
+                    "{module_name}: invalid column name "
+                    "('{col}') for table ({table_name}). "
+                    "Please ensure that the input table ({table_name}) "
+                    "has been preprocessed.".format(
                         module_name=self.module_name,
-                        dist_key_col=DISTRIBUTION_KEY_COLNAME,
-                        table=table))
+                        **locals()))
 
     def _is_valid_metrics_compute_frequency(self):
         return self.metrics_compute_frequency is None or \
@@ -389,6 +390,8 @@ class FitCommonValidator(object):
         if self.validation_table and self.validation_table.strip() != '':
             input_tbl_valid(self.validation_table, self.module_name)
             self._validate_input_table(self.validation_table, True)
+            validation_summary_table = add_postfix(self.validation_table, "_summary")
+            input_tbl_valid(validation_summary_table, self.module_name)
             for i in self.val_dep_var:
                 dependent_vartype = get_expr_type(i,
                                                   self.validation_table)
@@ -403,71 +406,53 @@ class FitCommonValidator(object):
                                input_shape, 2, True)
         if self.validation_table:
             InputValidator.validate_input_shape(
-                self.validation_table, self.independent_varname,
+                self.validation_table,  self.independent_varname,
                 input_shape, 2, True)
 
 
 class FitInputValidator(FitCommonValidator):
     def __init__(self, source_table, validation_table, output_model_table,
-                 model_arch_table, model_id, dependent_varname,
-                 independent_varname, dep_shape_cols, ind_shape_cols, num_iterations,
+                 model_arch_table, model_id, num_iterations,
                  metrics_compute_frequency, warm_start,
-                 use_gpus, accessible_gpus_for_seg, object_table, val_dep_var, val_ind_var):
+                 use_gpus, accessible_gpus_for_seg, object_table):
 
         self.module_name = 'madlib_keras_fit'
         super(FitInputValidator, self).__init__(source_table,
                                                 validation_table,
                                                 output_model_table,
-                                                model_arch_table,
-                                                model_id,
-                                                dependent_varname,
-                                                independent_varname,
-                                                dep_shape_cols,
-                                                ind_shape_cols,
                                                 num_iterations,
                                                 metrics_compute_frequency,
                                                 warm_start,
                                                 use_gpus,
                                                 accessible_gpus_for_seg,
                                                 self.module_name,
-                                                object_table,
-                                                val_dep_var,
-                                                val_ind_var)
-        InputValidator.validate_model_arch_table(self.module_name, self.model_arch_table,
-            self.model_id)
+                                                object_table
+                                                )
+        InputValidator.validate_model_arch_table(self.module_name, model_arch_table,
+            model_id)
 
 class FitMultipleInputValidator(FitCommonValidator):
     def __init__(self, source_table, validation_table, output_model_table,
-                 model_selection_table, model_selection_summary_table, dependent_varname,
-                 independent_varname, dep_shape_cols, ind_shape_cols,
-                 num_iterations, model_info_table, mst_key_col,
+                 model_selection_table, num_iterations, mst_key_col,
                  model_arch_table_col, metrics_compute_frequency, warm_start,
-                 use_gpus, accessible_gpus_for_seg, val_dep_var, val_ind_var):
+                 use_gpus, accessible_gpus_for_seg):
 
         self.module_name = 'madlib_keras_fit_multiple'
-
         input_tbl_valid(model_selection_table, self.module_name)
-        input_tbl_valid(model_selection_summary_table, self.module_name,
+        self.model_selection_summary_table = add_postfix(model_selection_table,
+                                                         '_summary')
+        input_tbl_valid(self.model_selection_summary_table, self.module_name,
                         error_suffix_str="Please ensure that the model selection table ({0}) "
                                          "has been created by "
                                          "load_model_selection_table().".format(
                                             model_selection_table))
         self.msts, self.model_arch_table, self.object_table = query_model_configs(
-            model_selection_table, model_selection_summary_table,
+            model_selection_table, self.model_selection_summary_table,
             mst_key_col, model_arch_table_col)
-        if warm_start:
-            input_tbl_valid(model_info_table, self.module_name)
-        else:
-            output_tbl_valid(model_info_table, self.module_name)
+        input_tbl_valid(self.model_arch_table, self.module_name)
         super(FitMultipleInputValidator, self).__init__(source_table,
                                                         validation_table,
                                                         output_model_table,
-                                                        self.model_arch_table,
-                                                        None,
-                                                        dependent_varname,
-                                                        independent_varname,
-                                                        dep_shape_cols,
-                                                        ind_shape_cols,
                                                         num_iterations,
                                                         metrics_compute_frequency,
                                                         warm_start,
@@ -477,6 +462,13 @@ class FitMultipleInputValidator(FitCommonValidator):
                                                         self.object_table,
                                                         val_dep_var,
                                                         val_ind_var)
+        self.output_model_info_table = add_postfix(output_model_table,
+                                                   '_info')
+
+        if warm_start:
+            input_tbl_valid(self.output_model_info_table, self.module_name)
+        else:
+            output_tbl_valid(self.output_model_info_table, self.module_name)
 
 class MstLoaderInputValidator():
     def __init__(self,
diff --git a/src/ports/postgres/modules/deep_learning/test/madlib_keras_evaluate.sql_in b/src/ports/postgres/modules/deep_learning/test/madlib_keras_evaluate.sql_in
index cdda44a..5eed811 100644
--- a/src/ports/postgres/modules/deep_learning/test/madlib_keras_evaluate.sql_in
+++ b/src/ports/postgres/modules/deep_learning/test/madlib_keras_evaluate.sql_in
@@ -61,6 +61,15 @@ SELECT assert(trap_error($TRAP$
     SELECT madlib_keras_evaluate('keras_saved_out', 'cifar_10_sample_val', 'evaluate_out', FALSE ,1);
     $TRAP$) = 1, 'Should error out if mst_key is given for non-multi model tables');
 
+DROP TABLE IF EXISTS evaluate_out;
+SELECT assert(test_input_table($test$SELECT madlib_keras_evaluate(
+    NULL, 'cifar_10_sample_val', 'evaluate_out', FALSE)$test$),
+    'Failed to assert the correct error message for null source table');
+
+SELECT assert(test_input_table($test$SELECT madlib_keras_evaluate(
+    'keras_saved_out', NULL, 'evaluate_out', FALSE)$test$),
+    'Failed to assert the correct error message for null source table');
+
 -- Test that evaluate errors out correctly if model_arch field missing from fit output
 DROP TABLE IF EXISTS evaluate_out;
 ALTER TABLE keras_saved_out DROP COLUMN model_arch;
diff --git a/src/ports/postgres/modules/deep_learning/test/madlib_keras_fit.sql_in b/src/ports/postgres/modules/deep_learning/test/madlib_keras_fit.sql_in
index 988d1f3..eaa6916 100644
--- a/src/ports/postgres/modules/deep_learning/test/madlib_keras_fit.sql_in
+++ b/src/ports/postgres/modules/deep_learning/test/madlib_keras_fit.sql_in
@@ -30,6 +30,48 @@
 )
 
 m4_include(`SQLCommon.m4')
+SELECT assert(test_output_table($test$SELECT madlib_keras_fit(
+    'cifar_10_sample_batched',
+    NULL,
+    'model_arch',
+    1,
+    $$ optimizer=SGD(lr=0.01, decay=1e-6, nesterov=True), loss='categorical_crossentropy', metrics=['mae']$$::text,
+    $$ batch_size=2, epochs=1, verbose=0 $$::text,
+    3)$test$), 'Failed to assert the correct error message for null output table');
+
+SELECT assert(test_input_table($test$SELECT madlib_keras_fit(
+    NULL,
+    'keras_saved_out',
+    'model_arch',
+    1,
+    $$ optimizer=SGD(lr=0.01, decay=1e-6, nesterov=True), loss='categorical_crossentropy', metrics=['mae']$$::text,
+    $$ batch_size=2, epochs=1, verbose=0 $$::text,
+    3,
+    NULL,
+    'cifar_10_sample_val')$test$), 'Failed to assert the correct error message for null source table');
+
+SELECT assert(test_input_table($test$SELECT madlib_keras_fit(
+    'cifar_10_sample_batched',
+    'keras_saved_out',
+    NULL,
+    1,
+    $$ optimizer=SGD(lr=0.01, decay=1e-6, nesterov=True), loss='categorical_crossentropy', metrics=['mae']$$::text,
+    $$ batch_size=2, epochs=1, verbose=0 $$::text,
+    3,
+    NULL,
+    'cifar_10_sample_val')$test$), 'Failed to assert the correct error message for null model arch table');
+
+SELECT assert(test_error_msg($test$SELECT madlib_keras_fit(
+    'cifar_10_sample_batched',
+    'keras_saved_out',
+    'model_arch',
+    1,
+    $$ optimizer=SGD(lr=0.01, decay=1e-6, nesterov=True), loss='categorical_crossentropy', metrics=['mae']$$::text,
+    $$ batch_size=2, epochs=1, verbose=0 $$::text,
+    3,
+    NULL,
+    'table_does_not_exist')$test$, $test$'table_does_not_exist' does not exist$test$
+    ), 'Failed to assert the correct error message for non existing validation table');
 
 -- Please do not break up the compile_params string
 -- It might break the assertion
diff --git a/src/ports/postgres/modules/deep_learning/test/madlib_keras_model_selection.sql_in b/src/ports/postgres/modules/deep_learning/test/madlib_keras_model_selection.sql_in
index 81554d3..2946184 100644
--- a/src/ports/postgres/modules/deep_learning/test/madlib_keras_model_selection.sql_in
+++ b/src/ports/postgres/modules/deep_learning/test/madlib_keras_model_selection.sql_in
@@ -342,6 +342,43 @@ SELECT load_model_selection_table(
         $$batch_size=32, epochs=1$$
     ]
 );
+----------- NULL input and output table validation
+DROP TABLE if exists iris_multiple_model, iris_multiple_model_summary, iris_multiple_model_info;
+SELECT assert(test_input_table($test$SELECT madlib_keras_fit_multiple_model(
+	NULL,
+	'iris_multiple_model',
+	'mst_table_4row',
+	1,
+	FALSE
+);$test$), 'Failed to assert the correct error message for null source table');
+
+DROP TABLE if exists iris_multiple_model, iris_multiple_model_summary, iris_multiple_model_info;
+SELECT assert(test_output_table($test$SELECT madlib_keras_fit_multiple_model(
+	'iris_data_packed',
+	NULL,
+	'mst_table_4row',
+	1,
+	FALSE
+);$test$), 'Failed to assert the correct error message for null output table');
+
+DROP TABLE if exists iris_multiple_model, iris_multiple_model_summary, iris_multiple_model_info;
+SELECT assert(test_input_table($test$SELECT madlib_keras_fit_multiple_model(
+	'iris_data_packed',
+	'iris_multiple_model',
+	NULL,
+	1,
+	FALSE
+);$test$), 'Failed to assert the correct error message for null mst table');
+
+DROP TABLE if exists iris_multiple_model, iris_multiple_model_summary, iris_multiple_model_info;
+SELECT assert(test_error_msg($test$SELECT madlib_keras_fit_multiple_model(
+	'iris_data_packed',
+	'iris_multiple_model',
+	'mst_table_4row',
+	1,
+    FALSE,
+	'table_does_not_exist'
+);$test$, $test$'table_does_not_exist' does not exist$test$), 'Failed to assert the correct error message for non existing validation table');
 
 -- Test for one-hot encoded input data
 CREATE OR REPLACE FUNCTION test_fit_multiple_one_hot_encoded_input(caching boolean)
diff --git a/src/ports/postgres/modules/deep_learning/test/madlib_keras_multi_io.sql_in b/src/ports/postgres/modules/deep_learning/test/madlib_keras_multi_io.sql_in
index 4afc47d..0c00851 100644
--- a/src/ports/postgres/modules/deep_learning/test/madlib_keras_multi_io.sql_in
+++ b/src/ports/postgres/modules/deep_learning/test/madlib_keras_multi_io.sql_in
@@ -119,3 +119,28 @@ SELECT madlib_keras_fit(
     'test_custom_function_table'
 );
 
+m4_changequote(`<!', `!>')
+m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!
+-- Multiple models test
+DROP TABLE IF EXISTS mst_table_1row, mst_table_1row_summary;
+SELECT load_model_selection_table(
+    'iris_model_arch',
+    'mst_table_1row',
+    ARRAY[1],
+    ARRAY[
+        $$loss='categorical_crossentropy', optimizer='Adam(lr=0.01)', metrics=['accuracy']$$
+    ],
+    ARRAY[
+        $$batch_size=16, epochs=1$$
+    ]
+);
+DROP TABLE if exists iris_model, iris_model_summary, iris_model_info;
+SELECT assert(test_error_msg($test$SELECT madlib_keras_fit_multiple_model(
+	'iris_mult_packed',
+	'iris_model',
+	'mst_table_1row',
+	3,
+	FALSE)$test$, 'Multiple dependent and independent variables not supported'),
+	'Failed to assert the correct error message for multi-io not supported');
+!>)
+
diff --git a/src/ports/postgres/modules/deep_learning/test/madlib_keras_predict.sql_in b/src/ports/postgres/modules/deep_learning/test/madlib_keras_predict.sql_in
index 82db074..9994739 100644
--- a/src/ports/postgres/modules/deep_learning/test/madlib_keras_predict.sql_in
+++ b/src/ports/postgres/modules/deep_learning/test/madlib_keras_predict.sql_in
@@ -66,6 +66,26 @@ SELECT assert(class_value IN ('0','1'),
     'Predicted value not in set of defined class values for model')
 FROM cifar10_predict;
 
+-- Test for null source table and null output table
+DROP TABLE IF EXISTS cifar10_predict;
+SELECT assert(test_input_table($test$SELECT madlib_keras_predict(
+    NULL,
+    'cifar_10_sample',
+    'id',
+    'x',
+    'cifar10_predict',
+    NULL,
+    FALSE)$test$), 'Failed to assert the correct error message for null model table');
+
+SELECT assert(test_input_table($test$SELECT madlib_keras_predict(
+    'keras_saved_out',
+    NULL,
+    'id',
+    'x',
+    'cifar10_predict',
+    NULL,
+    FALSE)$test$), 'Failed to assert the correct error message for null test table');
+
 DROP TABLE IF EXISTS cifar10_predict;
 SELECT assert(trap_error($TRAP$SELECT madlib_keras_predict(
     'keras_saved_out',
diff --git a/src/ports/postgres/modules/deep_learning/test/madlib_keras_predict_byom.sql_in b/src/ports/postgres/modules/deep_learning/test/madlib_keras_predict_byom.sql_in
index 5fcee51..6f258cd 100644
--- a/src/ports/postgres/modules/deep_learning/test/madlib_keras_predict_byom.sql_in
+++ b/src/ports/postgres/modules/deep_learning/test/madlib_keras_predict_byom.sql_in
@@ -67,6 +67,33 @@ SELECT assert(
 FROM iris_predict AS p0,  iris_predict_byom AS p1
 WHERE p0.id=p1.id;
 
+DROP TABLE IF EXISTS cifar10_predict;
+SELECT assert(test_input_table($test$SELECT madlib_keras_predict_byom(
+     'iris_model_arch',
+     2,
+     NULL,
+     'id',
+     'attributes',
+     'iris_predict_byom',
+     'response',
+     NULL,
+     ARRAY[ARRAY['Iris-setosa', 'Iris-versicolor',
+      'Iris-virginica']::text[]]
+     )$test$), 'Failed to assert the correct error message for null test table');
+
+SELECT assert(test_input_table($test$SELECT madlib_keras_predict_byom(
+     NULL,
+     2,
+     'iris_test',
+     'id',
+     'attributes',
+     'iris_predict_byom',
+     'response',
+     NULL,
+     ARRAY[ARRAY['Iris-setosa', 'Iris-versicolor',
+      'Iris-virginica']::text[]]
+     )$test$), 'Failed to assert the correct error message for null model table');
+
 -- class_values NULL, pred_type is NULL (response)
 DROP TABLE IF EXISTS iris_predict_byom;
 SELECT madlib_keras_predict_byom(
diff --git a/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in b/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in
index bb40fba..928b753 100644
--- a/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in
+++ b/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in
@@ -867,6 +867,7 @@ class MadlibKerasPredictBYOMTestCase(unittest.TestCase):
         self.module.InputValidator.validate_predict_byom_tables = Mock()
         self.module.InputValidator.validate_input_shape = Mock()
         self.module.BasePredict.call_internal_keras = Mock()
+        self.module.input_tbl_valid = Mock()
 
     def tearDown(self):
         self.module_patcher.stop()
@@ -1278,9 +1279,11 @@ class MadlibKerasFitCommonValidatorTestCase(unittest.TestCase):
         self.module_patcher.start()
         import madlib_keras_validator
         self.subject = madlib_keras_validator
-        self.subject.FitCommonValidator._validate_common_args = Mock()
-        self.dep_shape_cols = [[10,1,1,1]]
-        self.ind_shape_cols = [[10,2]]
+        self.subject.FitCommonValidator._validate_tables = Mock()
+        self.subject.FitCommonValidator.get_source_summary_table_dict = \
+            Mock(return_value={'dependent_varname':['a'],
+                               'independent_varname':['b']})
+        self.subject.FitCommonValidator._validate_tables_schema = Mock()
 
     def tearDown(self):
         self.module_patcher.stop()
@@ -1288,34 +1291,26 @@ class MadlibKerasFitCommonValidatorTestCase(unittest.TestCase):
 
     def test_is_valid_metrics_compute_frequency_True_None(self):
         obj = self.subject.FitCommonValidator(
-            'test_table', 'val_table', 'model_table', 'model_arch_table', 2,
-            'dep_varname', 'independent_varname', self.dep_shape_cols,
-            self.ind_shape_cols, 5, None, False, False, [0],
-            'module_name', None, None, None)
+            'test_table', 'val_table', 'model_table', 5, None, False, False, [0],
+            'module_name', None)
         self.assertEqual(True, obj._is_valid_metrics_compute_frequency())
 
     def test_is_valid_metrics_compute_frequency_True_num(self):
         obj = self.subject.FitCommonValidator(
-            'test_table', 'val_table', 'model_table', 'model_arch_table', 2,
-            'dep_varname', 'independent_varname', self.dep_shape_cols,
-            self.ind_shape_cols, 5, 3, False, False, [0],
-            'module_name', None, None, None)
+            'test_table', 'val_table', 'model_table', 5, 3, False, False, [0],
+            'module_name', None)
         self.assertEqual(True, obj._is_valid_metrics_compute_frequency())
 
     def test_is_valid_metrics_compute_frequency_False_zero(self):
         obj = self.subject.FitCommonValidator(
-            'test_table', 'val_table', 'model_table', 'model_arch_table', 2,
-            'dep_varname', 'independent_varname', self.dep_shape_cols,
-            self.ind_shape_cols, 5, 0, False, False, [0],
-            'module_name', None, None, None)
+            'test_table', 'val_table', 'model_table', 5, 0, False, False, [0],
+            'module_name', None)
         self.assertEqual(False, obj._is_valid_metrics_compute_frequency())
 
     def test_is_valid_metrics_compute_frequency_False_greater(self):
         obj = self.subject.FitCommonValidator(
-            'test_table', 'val_table', 'model_table', 'model_arch_table', 2,
-            'dep_varname', 'independent_varname', self.dep_shape_cols,
-            self.ind_shape_cols, 5, 6, False, False, [0],
-            'module_name', None, None, None)
+            'test_table', 'val_table', 'model_table', 5, 6, False, False, [0],
+            'module_name', None)
         self.assertEqual(False, obj._is_valid_metrics_compute_frequency())
 
 
diff --git a/src/ports/postgres/modules/utilities/utilities.sql_in b/src/ports/postgres/modules/utilities/utilities.sql_in
index bbf861d..23abb40 100644
--- a/src/ports/postgres/modules/utilities/utilities.sql_in
+++ b/src/ports/postgres/modules/utilities/utilities.sql_in
@@ -542,6 +542,32 @@ BEGIN
 END;
 $$ LANGUAGE plpgsql;
 
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.test_error_msg(
+  stmt TEXT,
+  msg  TEXT
+)
+RETURNS BOOLEAN AS $$
+try:
+    plpy.execute(stmt)
+    return TRUE
+except Exception as ex:
+    return msg in ex.message
+$$ LANGUAGE plpythonu;
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.test_input_table(
+  stmt TEXT
+)
+RETURNS BOOLEAN AS $$
+SELECT MADLIB_SCHEMA.test_error_msg($1, 'NULL/empty input table name');
+$$ LANGUAGE SQL;
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.test_output_table(
+  stmt TEXT
+)
+RETURNS BOOLEAN AS $$
+SELECT MADLIB_SCHEMA.test_error_msg($1, 'NULL/empty output table name');
+$$ LANGUAGE SQL;
+
 -- A few of the gucs like plan_cache_mode and dev_opt_unsafe_truncate_in_subtransaction
 -- are only available in either > pg 11 or > gpdb 6.5. Using this function we
 -- can make sure to run the guc assertion test (assert_guc_value) on the correct

[madlib] 04/04: DL: Cleanup fit and fit_multiple

Posted by nk...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

nkak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git

commit a0f711cc50762ec30f1b5f5f5d435286f8e94248
Author: Nikhil Kak <nk...@vmware.com>
AuthorDate: Fri Jan 29 18:29:31 2021 -0800

    DL: Cleanup fit and fit_multiple
    
    JIRA: MADLIB-1464
    
    Previously we were creating a columns_dict variable which contained the
    output of the packed summary table. This led to code being slightly
    harder to maintain and also some duplication.
    
    This commit removes this variable and the code now relies directly on the
    output of the summary table.
    
    Also renamed a few variables for consistency.
    
    Co-authored-by: Ekta Khanna <ek...@vmware.com>
---
 .../modules/deep_learning/madlib_keras.py_in       | 127 +++++++++------------
 .../madlib_keras_fit_multiple_model.py_in          |  47 ++++----
 .../deep_learning/madlib_keras_validator.py_in     |  55 ++++-----
 .../deep_learning/test/madlib_keras_fit.sql_in     |  24 ++++
 .../test/unit_tests/test_madlib_keras.py_in        |  12 +-
 5 files changed, 131 insertions(+), 134 deletions(-)

diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.py_in b/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
index c4f8611..67f2a56 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras.py_in
@@ -120,27 +120,10 @@ def fit(schema_madlib, source_table, model, model_arch_table,
         num_iterations, metrics_compute_frequency, warm_start,
         use_gpus, accessible_gpus_for_seg, object_table)
 
-    columns_dict = {}
-    columns_dict['mb_dep_var_cols'] = fit_validator.dependent_varname
-    columns_dict['mb_indep_var_cols'] = fit_validator.independent_varname
-    columns_dict['dep_shape_cols'] = fit_validator.dep_shape_cols
-    columns_dict['ind_shape_cols'] = fit_validator.ind_shape_cols
-    columns_dict['val_dep_var'] = fit_validator.val_dep_var
-    columns_dict['val_ind_var'] = fit_validator.val_ind_var
-    columns_dict['val_dep_shape_cols'] = fit_validator.val_dep_shape_cols
-    columns_dict['val_ind_shape_cols'] = fit_validator.val_ind_shape_cols
     multi_dep_count = len(fit_validator.dependent_varname)
-
-    # fit_validator.dependent_varname = columns_dict['mb_dep_var_cols']
-    # fit_validator.independent_varname = columns_dict['mb_indep_var_cols']
-    # fit_validator.dep_shape_col = columns_dict['dep_shape_cols']
-    # fit_validator.ind_shape_col = columns_dict['ind_shape_cols']
     src_summary_dict = fit_validator.src_summary_dict
-    class_values_colnames = [add_postfix(i, "_class_values") for i in columns_dict['mb_dep_var_cols']]
-    src_summary_dict['class_values_type'] =[ get_expr_type(
-        i, fit_validator.source_summary_table) for i in class_values_colnames]
-    src_summary_dict['norm_const_type'] = get_expr_type(
-        NORMALIZING_CONST_COLNAME, fit_validator.source_summary_table)
+    class_values_colnames = [add_postfix(i, "_class_values") for i in
+                             fit_validator.dependent_varname]
 
     if metrics_compute_frequency is None:
         metrics_compute_frequency = num_iterations
@@ -172,10 +155,16 @@ def fit(schema_madlib, source_table, model, model_arch_table,
     serialized_weights = get_initial_weights(model, model_arch, model_weights,
                                              warm_start, accessible_gpus_for_seg)
     # Compute total images on each segment
-    dist_key_mapping, images_per_seg_train = get_image_count_per_seg_for_minibatched_data_from_db(source_table, columns_dict['dep_shape_cols'][0])
+    shape_col = fit_validator.dependent_shape_varname[0]
+    dist_key_mapping, images_per_seg_train = \
+        get_image_count_per_seg_for_minibatched_data_from_db(source_table,
+                                                             shape_col)
 
     if validation_table:
-        dist_key_mapping_val, images_per_seg_val = get_image_count_per_seg_for_minibatched_data_from_db(validation_table, columns_dict['dep_shape_cols'][0])
+        shape_col = fit_validator.val_dependent_shape_varname[0]
+        dist_key_mapping_val, images_per_seg_val = \
+            get_image_count_per_seg_for_minibatched_data_from_db(validation_table,
+                                                                 shape_col)
 
     # Construct validation dataset if provided
     validation_set_provided = bool(validation_table)
@@ -198,31 +187,31 @@ def fit(schema_madlib, source_table, model, model_arch_table,
         plpy.error("Object table not specified for function {0} in compile_params".format(custom_fn_list))
 
     # Use the smart interface
-    if (len(columns_dict['mb_dep_var_cols']) <= 5 and
-        len(columns_dict['mb_indep_var_cols']) <= 5):
+    if (len(fit_validator.dependent_varname) <= 5 and
+        len(fit_validator.independent_varname) <= 5):
 
         dep_var_array = 5 * ["NULL"]
         indep_var_array = 5 * ["NULL"]
 
-        for counter, var in enumerate(columns_dict['mb_dep_var_cols']):
+        for counter, var in enumerate(fit_validator.dependent_varname):
             dep_var_array[counter] = var
 
-        for counter, var in enumerate(columns_dict['mb_indep_var_cols']):
+        for counter, var in enumerate(fit_validator.independent_varname):
             indep_var_array[counter] = var
         mb_dep_var_cols_sql = ', '.join(dep_var_array)
         mb_indep_var_cols_sql = ', '.join(indep_var_array)
     else:
 
         mb_dep_var_cols_sql = ', '.join(["dependent_var_{0}".format(i)
-                                    for i in columns_dict['mb_dep_var_cols']])
+                                    for i in fit_validator.dependent_varname])
         mb_dep_var_cols_sql = "ARRAY[{0}]".format(mb_dep_var_cols_sql)
 
         mb_indep_var_cols_sql = ', '.join(["independent_var_{0}".format(i)
-                                    for i in columns_dict['mb_indep_var_cols']])
+                                    for i in fit_validator.independent_varname])
         mb_indep_var_cols_sql = "ARRAY[{0}]".format(mb_indep_var_cols_sql)
 
-    dep_shape_cols_sql = ', '.join(columns_dict['dep_shape_cols'])
-    ind_shape_cols_sql = ', '.join(columns_dict['ind_shape_cols'])
+    dep_shape_cols_sql = ', '.join(fit_validator.dependent_shape_varname)
+    ind_shape_cols_sql = ', '.join(fit_validator.independent_shape_varname)
 
     run_training_iteration = plpy.prepare("""
         SELECT {schema_madlib}.fit_step(
@@ -295,7 +284,8 @@ def fit(schema_madlib, source_table, model, model_arch_table,
                 should_clear_session = is_final_iteration
 
             compute_out = compute_loss_and_metrics(schema_madlib, source_table,
-                                                   columns_dict,
+                                                   fit_validator.dependent_varname,
+                                                   fit_validator.independent_varname,
                                                    compile_params_to_pass,
                                                    model_arch,
                                                    serialized_weights, use_gpus,
@@ -314,7 +304,8 @@ def fit(schema_madlib, source_table, model, model_arch_table,
                 # Compute loss/accuracy for validation data.
                 val_compute_out = compute_loss_and_metrics(schema_madlib,
                                                            validation_table,
-                                                           columns_dict,
+                                                           fit_validator.val_dependent_varname,
+                                                           fit_validator.val_independent_varname,
                                                            compile_params_to_pass,
                                                            model_arch,
                                                            serialized_weights,
@@ -337,9 +328,7 @@ def fit(schema_madlib, source_table, model, model_arch_table,
     end_training_time = datetime.datetime.now()
 
     version = madlib_version(schema_madlib)
-    class_values_type = src_summary_dict['class_values_type']
     norm_const = src_summary_dict['normalizing_const']
-    norm_const_type = src_summary_dict['norm_const_type']
     dep_vartype = src_summary_dict['dependent_vartype']
     dependent_varname = src_summary_dict['dependent_varname']
     independent_varname = src_summary_dict['independent_varname']
@@ -504,33 +493,32 @@ def get_source_summary_table_dict(source_summary_table):
 
     return source_summary
 
-def compute_loss_and_metrics(schema_madlib, table, columns_dict, compile_params,
+def compute_loss_and_metrics(schema_madlib, table, dependent_varname,
+                             independent_varname, compile_params,
                              model_arch, serialized_weights, use_gpus,
                              accessible_gpus_for_seg, segments_per_host,
                              dist_key_mapping, images_per_seg_val, metrics_list,
                              loss_list, should_clear_session, custom_fn_map,
-                             model_table=None, mst_key=None, is_train=True):
+                             model_table=None, mst_key=None):
     """
     Compute the loss and metric using a given model (serialized_weights) on the
     given dataset (table.)
     """
     start_val = time.time()
-    evaluate_result = get_loss_metric_from_keras_eval(schema_madlib,
-                                                   table,
-                                                   columns_dict,
-                                                   compile_params,
-                                                   model_arch,
-                                                   serialized_weights,
-                                                   use_gpus,
-                                                   accessible_gpus_for_seg,
-                                                   segments_per_host,
-                                                   dist_key_mapping,
-                                                   images_per_seg_val,
-                                                   should_clear_session,
-                                                   custom_fn_map,
-                                                   model_table,
-                                                   mst_key,
-                                                   is_train)
+    evaluate_result = get_loss_metric_from_keras_eval(schema_madlib, table,
+                                                      dependent_varname,
+                                                      independent_varname,
+                                                      compile_params,
+                                                      model_arch,
+                                                      serialized_weights,
+                                                      use_gpus,
+                                                      accessible_gpus_for_seg,
+                                                      segments_per_host,
+                                                      dist_key_mapping,
+                                                      images_per_seg_val,
+                                                      should_clear_session,
+                                                      custom_fn_map, model_table,
+                                                      mst_key)
     end_val = time.time()
     loss = evaluate_result[0]
     metric = evaluate_result[1]
@@ -882,14 +870,11 @@ def evaluate(schema_madlib, model_table, test_table, output_table,
     # independent_varname = model_summary_dict['independent_varname']
     # ind_shape_cols = [add_postfix(i, "_shape") for i in independent_varname]
 
-    columns_dict = {}
-    columns_dict['mb_dep_var_cols'] = model_summary_dict['dependent_varname']
-    columns_dict['mb_indep_var_cols'] = model_summary_dict['independent_varname']
-    columns_dict['dep_shape_cols'] = [add_postfix(i, "_shape") for i in columns_dict['mb_dep_var_cols']]
-    columns_dict['ind_shape_cols'] = [add_postfix(i, "_shape") for i in columns_dict['mb_indep_var_cols']]
+    dep_varname = model_summary_dict['dependent_varname']
+    indep_varname = model_summary_dict['independent_varname']
 
     InputValidator.validate_input_shape(
-        test_table, columns_dict['mb_indep_var_cols'], input_shape, 2, True)
+        test_table, indep_varname, input_shape, 2, True)
 
     compile_params_query = "SELECT compile_params, metrics_type, object_table FROM {0}".format(model_summary_table)
     res = plpy.execute(compile_params_query)[0]
@@ -902,11 +887,13 @@ def evaluate(schema_madlib, model_table, test_table, output_table,
         custom_fn_list = get_custom_functions_list(res['compile_params'])
         custom_function_map = query_custom_functions_map(object_table, custom_fn_list)
 
-    dist_key_mapping, images_per_seg = get_image_count_per_seg_for_minibatched_data_from_db(test_table, columns_dict['ind_shape_cols'][0])
+    shape_col = add_postfix(dep_varname[0], "_shape")
+    dist_key_mapping, images_per_seg = \
+        get_image_count_per_seg_for_minibatched_data_from_db(test_table, shape_col)
 
     loss_metric = \
         get_loss_metric_from_keras_eval(
-            schema_madlib, test_table, columns_dict, compile_params, model_arch,
+            schema_madlib, test_table, dep_varname, indep_varname, compile_params, model_arch,
             model_weights, use_gpus, accessible_gpus_for_seg, segments_per_host,
             dist_key_mapping, images_per_seg, custom_function_map=custom_function_map)
 
@@ -951,12 +938,13 @@ def validate_evaluate(module_name, model_table, model_summary_table, test_table,
     for i in dependent_varname:
         validate_bytea_var_for_minibatch(test_table, i)
 
-def get_loss_metric_from_keras_eval(schema_madlib, table, columns_dict, compile_params,
+def get_loss_metric_from_keras_eval(schema_madlib, table, dependent_varname,
+                                    independent_varname, compile_params,
                                     model_arch, serialized_weights, use_gpus,
                                     accessible_gpus_for_seg, segments_per_host,
                                     dist_key_mapping, images_per_seg,
                                     should_clear_session=True, custom_function_map=None,
-                                    model_table=None, mst_key=None, is_train=True):
+                                    model_table=None, mst_key=None):
     """
     This function will call the internal keras evaluate function to get the loss
     and accuracy of each tuple which then gets averaged to get the final result.
@@ -971,17 +959,12 @@ def get_loss_metric_from_keras_eval(schema_madlib, table, columns_dict, compile_
     """
     use_gpus = use_gpus if use_gpus else False
 
-    if is_train:
-        mb_dep_var_cols_sql = ', '.join(columns_dict['mb_dep_var_cols'])
-        mb_indep_var_cols_sql = ', '.join(columns_dict['mb_indep_var_cols'])
-        dep_shape_cols_sql = ', '.join(columns_dict['dep_shape_cols'])
-        ind_shape_cols_sql = ', '.join(columns_dict['ind_shape_cols'])
-    else:
-        mb_dep_var_cols_sql = ', '.join(columns_dict['val_dep_var'])
-        mb_indep_var_cols_sql = ', '.join(columns_dict['val_ind_var'])
-        dep_shape_cols_sql = ', '.join(columns_dict['val_dep_shape_cols'])
-        ind_shape_cols_sql = ', '.join(columns_dict['val_ind_shape_cols'])
-
+    mb_dep_var_cols_sql = ', '.join(dependent_varname)
+    mb_indep_var_cols_sql = ', '.join(independent_varname)
+    dep_shape_cols = [add_postfix(i, "_shape") for i in dependent_varname]
+    ind_shape_cols = [add_postfix(i, "_shape") for i in independent_varname]
+    dep_shape_cols_sql = ', '.join(dep_shape_cols)
+    ind_shape_cols_sql = ', '.join(ind_shape_cols)
 
     eval_sql = """
         select ({schema_madlib}.internal_keras_evaluate(
diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in
index 22b9401..aa7a2bc 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in
@@ -168,15 +168,6 @@ class FitMultipleModel(object):
         self.msts = self.fit_validator_train.msts
         self.model_arch_table = self.fit_validator_train.model_arch_table
         self.object_table = self.fit_validator_train.object_table
-        self.columns_dict = {}
-        self.columns_dict['mb_dep_var_cols'] = self.fit_validator_train.dependent_varname
-        self.columns_dict['mb_indep_var_cols'] = self.fit_validator_train.independent_varname
-        self.columns_dict['dep_shape_cols'] = self.fit_validator_train.dep_shape_cols
-        self.columns_dict['ind_shape_cols'] = self.fit_validator_train.ind_shape_cols
-        self.columns_dict['val_dep_var'] = self.fit_validator_train.val_dep_var
-        self.columns_dict['val_ind_var'] = self.fit_validator_train.val_ind_var
-        self.columns_dict['val_dep_shape_cols'] = self.fit_validator_train.val_dep_shape_cols
-        self.columns_dict['val_ind_shape_cols'] = self.fit_validator_train.val_ind_shape_cols
 
         self.metrics_iters = []
         self.object_map_col = 'object_map'
@@ -188,17 +179,19 @@ class FitMultipleModel(object):
         if CUDA_VISIBLE_DEVICES_KEY in os.environ:
             self.original_cuda_env = os.environ[CUDA_VISIBLE_DEVICES_KEY]
 
+        shape_col = self.fit_validator_train.dependent_shape_varname[0]
         self.dist_key_mapping, self.images_per_seg_train = \
             get_image_count_per_seg_for_minibatched_data_from_db(
-                self.source_table, self.fit_validator_train.dep_shape_cols[0])
+                self.source_table, shape_col)
 
         if self.validation_table:
+            shape_col = self.fit_validator_train.val_dependent_shape_varname[0]
             self.valid_mst_metric_eval_time = defaultdict(list)
             self.valid_mst_loss = defaultdict(list)
             self.valid_mst_metric = defaultdict(list)
             self.dist_key_mapping_valid, self.images_per_seg_valid = \
                 get_image_count_per_seg_for_minibatched_data_from_db(
-                    self.validation_table, self.fit_validator_train.val_dep_shape_cols[0])
+                    self.validation_table, shape_col)
 
         self.dist_keys = query_dist_keys(self.source_table, self.dist_key_col)
         self.max_dist_key = sorted(self.dist_keys)[-1]
@@ -312,16 +305,17 @@ class FitMultipleModel(object):
     def evaluate_model(self, iter, table, is_train):
         if is_train:
             label = "training"
-        else:
-            label = "validation"
-
-        if is_train:
+            dependent_varname = self.fit_validator_train.dependent_varname
+            independent_varname = self.fit_validator_train.independent_varname
             mst_metric_eval_time = self.train_mst_metric_eval_time
             mst_loss = self.train_mst_loss
             mst_metric = self.train_mst_metric
             seg_ids = self.dist_key_mapping
             images_per_seg = self.images_per_seg_train
         else:
+            label = "validation"
+            dependent_varname = self.fit_validator_train.val_dependent_varname
+            independent_varname = self.fit_validator_train.val_independent_varname
             mst_metric_eval_time = self.valid_mst_metric_eval_time
             mst_loss = self.valid_mst_loss
             mst_metric = self.valid_mst_metric
@@ -333,21 +327,20 @@ class FitMultipleModel(object):
             model_arch = get_model_arch(self.model_arch_table, mst[self.model_id_col])
             DEBUG.start_timing('eval_compute_loss_and_metrics')
             eval_compute_time, metric, loss = compute_loss_and_metrics(
-                self.schema_madlib, table, self.columns_dict,
-                    "$madlib${0}$madlib$".format(
+                self.schema_madlib, table, dependent_varname, independent_varname,
+                "$madlib${0}$madlib$".format(
                     mst[self.compile_params_col]),
-                    model_arch,
-                    None,
-                    self.use_gpus,
-                    self.accessible_gpus_for_seg,
-                    self.segments_per_host,
+                model_arch,
+                None,
+                self.use_gpus,
+                self.accessible_gpus_for_seg,
+                self.segments_per_host,
                 seg_ids,
                 images_per_seg,
                 [], [], True,
                 mst[self.object_map_col],
                 self.model_output_tbl,
-                mst[self.mst_key_col],
-                    is_train)
+                mst[self.mst_key_col])
             total_eval_compute_time += eval_compute_time
             mst_metric_eval_time[mst[self.mst_key_col]] \
                 .append(self.metrics_elapsed_time_offset + (time.time() - self.metrics_elapsed_start_time))
@@ -683,7 +676,7 @@ class FitMultipleModel(object):
 
         class_values_colnames = [add_postfix(i, "_class_values") for i in self.fit_validator_train.dependent_varname]
         # class_values = src_summary_dict['class_values']
-        class_values_type =[get_expr_type(i, source_summary_table) for i in class_values_colnames]
+        # class_values_type =[get_expr_type(i, source_summary_table) for i in class_values_colnames]
         # class_values_type = src_summary_dict['class_values_type']
 
         dependent_varname = src_summary_dict['dependent_varname']
@@ -865,8 +858,8 @@ class FitMultipleModel(object):
             """.format(self=self))
 
         #TODO: Fix these to add multi io
-        dep_shape_col = self.fit_validator_train.dep_shape_cols[0]
-        ind_shape_col = self.fit_validator_train.ind_shape_cols[0]
+        dep_shape_col = self.fit_validator_train.dependent_shape_varname[0]
+        ind_shape_col = self.fit_validator_train.independent_shape_varname[0]
         dep_var_col = self.fit_validator_train.dependent_varname[0]
         indep_var_col = self.fit_validator_train.independent_varname[0]
         source_table = self.source_table
diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in b/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in
index 439d9d9..535d70d 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in
@@ -291,23 +291,24 @@ class FitCommonValidator(object):
         self.independent_varname = self.src_summary_dict['independent_varname']
         if not isinstance(self.dependent_varname, list) or \
                 not isinstance(self.independent_varname, list):
-            #TODO improve error message
-            plpy.error("Input table '{0}' has not been preprocessed properly. "
-                       "Please run input preprocessor again.".format(self.source_table))
-        self.dep_shape_cols = [add_postfix(i, "_shape") for i in self.dependent_varname]
-        self.ind_shape_cols = [add_postfix(i, "_shape") for i in self.independent_varname]
-
-        self.val_dep_var = None
-        self.val_ind_var = None
-        self.val_dep_shape_cols = None
-        self.val_ind_shape_cols = None
+            plpy.error("Input table '{0}' was preprocessed with "\
+                       "an older version of the input preprocessor. "
+                       "Please re-run the current version of input preprocessor "\
+                       "on the dataset.".format(self.source_table))
+        self.dependent_shape_varname = [add_postfix(i, "_shape") for i in self.dependent_varname]
+        self.independent_shape_varname = [add_postfix(i, "_shape") for i in self.independent_varname]
+
+        self.val_dependent_varname = None
+        self.val_independent_varname = None
+        self.val_dependent_shape_varname = None
+        self.val_independent_shape_varname = None
         if self.validation_table:
             val_summary_dict = self.get_source_summary_table_dict(self.validation_summary_table)
 
-            self.val_dep_var = val_summary_dict['dependent_varname']
-            self.val_ind_var = val_summary_dict['independent_varname']
-            self.val_dep_shape_cols = [add_postfix(i, "_shape") for i in self.val_dep_var]
-            self.val_ind_shape_cols = [add_postfix(i, "_shape") for i in self.val_ind_var]
+            self.val_dependent_varname = val_summary_dict['dependent_varname']
+            self.val_independent_varname = val_summary_dict['independent_varname']
+            self.val_dependent_shape_varname = [add_postfix(i, "_shape") for i in self.val_dependent_varname]
+            self.val_independent_shape_varname = [add_postfix(i, "_shape") for i in self.val_independent_varname]
 
         self._validate_tables_schema()
         if use_gpus:
@@ -340,22 +341,22 @@ class FitCommonValidator(object):
             additional_cols.append(DISTRIBUTION_KEY_COLNAME)
 
         self._validate_columns_in_preprocessed_table(self.source_table,
-                                                    self.independent_varname +
-                                                    self.dependent_varname +
-                                                    self.ind_shape_cols +
-                                                    self.dep_shape_cols +
-                                                    additional_cols)
+                                                     self.independent_varname +
+                                                     self.dependent_varname +
+                                                     self.independent_shape_varname +
+                                                     self.dependent_shape_varname +
+                                                     additional_cols)
         for i in self.dependent_varname:
             validate_bytea_var_for_minibatch(self.source_table, i)
 
         if self.validation_table and self.validation_table.strip() != '':
             self._validate_columns_in_preprocessed_table(self.validation_table,
-                                                        self.val_ind_var +
-                                                        self.val_dep_var +
-                                                        self.val_ind_shape_cols +
-                                                        self.val_dep_shape_cols+
-                                                        additional_cols)
-            for i in self.val_dep_var:
+                                                         self.val_independent_varname +
+                                                         self.val_dependent_varname +
+                                                         self.val_independent_shape_varname +
+                                                         self.val_dependent_shape_varname +
+                                                         additional_cols)
+            for i in self.val_dependent_varname:
                 validate_bytea_var_for_minibatch(self.validation_table, i)
 
         cols_in_tbl_valid(self.source_summary_table,
@@ -397,7 +398,7 @@ class FitCommonValidator(object):
             self._validate_input_table(self.validation_table, True)
             validation_summary_table = add_postfix(self.validation_table, "_summary")
             input_tbl_valid(validation_summary_table, self.module_name)
-            for i in self.val_dep_var:
+            for i in self.val_dependent_varname:
                 dependent_vartype = get_expr_type(i,
                                                   self.validation_table)
                 _assert(dependent_vartype == 'bytea',
@@ -411,7 +412,7 @@ class FitCommonValidator(object):
                                input_shape, 2, True)
         if self.validation_table:
             InputValidator.validate_input_shape(
-                self.validation_table,  self.val_ind_var,
+                self.validation_table,  self.val_independent_varname,
                 input_shape, 2, True)
 
 
diff --git a/src/ports/postgres/modules/deep_learning/test/madlib_keras_fit.sql_in b/src/ports/postgres/modules/deep_learning/test/madlib_keras_fit.sql_in
index eaa6916..74aff3c 100644
--- a/src/ports/postgres/modules/deep_learning/test/madlib_keras_fit.sql_in
+++ b/src/ports/postgres/modules/deep_learning/test/madlib_keras_fit.sql_in
@@ -514,3 +514,27 @@ SELECT madlib_keras_fit(
 	FALSE
 );
 SELECT assert(sum(get_gd_keys_len()) = 0, 'GD was not cleared properly!') m4_ifdef(<!__POSTGRESQL__!>, <!!>, <! FROM gp_dist_random('gp_id') !>);
+
+--- Test when source table and validation table have different column names
+DROP TABLE IF EXISTS iris_data_2;
+CREATE TABLE iris_data_2 as SELECT id, attributes as val_attributes, class_text as val_class_text FROM iris_data;
+DROP TABLE IF EXISTS iris_data_val_packed_2, iris_data_val_packed_2_summary;
+SELECT validation_preprocessor_dl('iris_data_2',    -- Source table
+                                'iris_data_val_packed_2',  -- Output table
+                                'val_class_text',     -- Dependent variable
+                                'val_attributes',     -- Independent variable
+                                'iris_data_packed'    -- Training preprocessed table
+                                );
+
+DROP TABLE if exists iris_model, iris_model_summary;
+SELECT madlib_keras_fit(
+	'iris_data_packed',
+	'iris_model',
+	'iris_model_arch',
+	1,
+	$$loss='categorical_crossentropy', optimizer='Adam(lr=0.01)', metrics=['accuracy']$$,
+  $$batch_size=16, epochs=1$$,
+	3,
+	FALSE,
+    'iris_data_val_packed_2'
+);
diff --git a/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in b/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in
index 5ef4517..164d743 100644
--- a/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in
+++ b/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in
@@ -906,7 +906,6 @@ class MadlibKerasPredictBYOMTestCase(unittest.TestCase):
                                      self.dependent_count)
         self.assertIn('invalid_pred_type', str(error.exception))
 
-        # The validation for this test has been disabled
         with self.assertRaises(plpy.PLPYException) as error:
             self.module.PredictBYOM('schema_madlib', 'model_arch_table',
                                      'model_id', 'test_table', 'id_col',
@@ -1314,36 +1313,33 @@ class MadlibKerasFitCommonValidatorTestCase(unittest.TestCase):
         self.assertEqual(False, obj._is_valid_metrics_compute_frequency())
 
     def test_validator_dep_indep_type_not_array(self):
+        expected_error_regex = "test_table.*preprocessed.*older version.*input preprocessor.*"
         # only dep is not array
         self.subject.FitCommonValidator.get_source_summary_table_dict = \
             Mock(return_value={'dependent_varname':'a',
                                'independent_varname':['b']})
-        with self.assertRaises(plpy.PLPYException) as error:
+        with self.assertRaisesRegexp(plpy.PLPYException, expected_error_regex):
             self.subject.FitCommonValidator(
                 'test_table', 'val_table', 'model_table', 5, None, False, False, [0],
                 'module_name', None)
-        self.assertIn('not been preprocessed properly', str(error.exception))
 
         # only indep is not array
         self.subject.FitCommonValidator.get_source_summary_table_dict = \
             Mock(return_value={'dependent_varname':['a'],
                                'independent_varname':'b'})
-        with self.assertRaises(plpy.PLPYException) as error:
+        with self.assertRaisesRegexp(plpy.PLPYException, expected_error_regex):
             self.subject.FitCommonValidator(
                 'test_table', 'val_table', 'model_table', 5, None, False, False, [0],
                 'module_name', None)
-        self.assertIn('not been preprocessed properly', str(error.exception))
 
         # both indep and dep are not arrays
         self.subject.FitCommonValidator.get_source_summary_table_dict = \
             Mock(return_value={'dependent_varname':'a',
                                'independent_varname':'b'})
-        with self.assertRaises(plpy.PLPYException) as error:
+        with self.assertRaisesRegexp(plpy.PLPYException, expected_error_regex):
             self.subject.FitCommonValidator(
                 'test_table', 'val_table', 'model_table', 5, None, False, False, [0],
                 'module_name', None)
-        self.assertIn('not been preprocessed properly', str(error.exception))
-
 
 class InputValidatorTestCase(unittest.TestCase):
     def setUp(self):

[madlib] 03/04: DL: Fix misc bugs

Posted by nk...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

nkak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git

commit fe42e7f5ec2fe1c1d5cc069dd01929c4131ac4d8
Author: Ekta Khanna <ek...@vmware.com>
AuthorDate: Wed Jan 27 16:30:29 2021 -0800

    DL: Fix misc bugs
    
    JIRA: MADLIB-1464
    
    1. When validating for the validation table, we were passing the wrong
    table name to the validate_input_shape function.
    
    2. Add not supported error message for Multiple dependent and
    independent variables for fit_multiple
    
    3. PredictBYOM: Uncomment code and test for validating
    class_values(validate_class_values)
    
    4. Add error message for the case when fit and fit_multiple are called
    with an old version of preprocessed data.
    
    Co-authored-by: Ekta Khanna <ek...@vmware.com>
---
 .../deep_learning/madlib_keras_predict.py_in       |  4 +-
 .../deep_learning/madlib_keras_validator.py_in     | 18 +++++---
 .../test/unit_tests/test_madlib_keras.py_in        | 53 +++++++++++++++++-----
 3 files changed, 56 insertions(+), 19 deletions(-)

diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_predict.py_in b/src/ports/postgres/modules/deep_learning/madlib_keras_predict.py_in
index 0e5b1b9..d23d765 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras_predict.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras_predict.py_in
@@ -337,8 +337,8 @@ class PredictBYOM(BasePredict):
         # are traversed in order. It won't work for multi-io and prone to breaking
         # in the regular case.
 
-        # InputValidator.validate_class_values(
-        #     self.module_name, self.class_values, self.pred_type, self.model_arch)
+        InputValidator.validate_class_values(
+            self.module_name, self.class_values, self.pred_type, self.model_arch)
         InputValidator.validate_input_shape(
             self.test_table, self.independent_varname,
             get_input_shape(self.model_arch), 1)
diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in b/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in
index 21eff15..439d9d9 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras_validator.py_in
@@ -289,6 +289,11 @@ class FitCommonValidator(object):
 
         self.dependent_varname = self.src_summary_dict['dependent_varname']
         self.independent_varname = self.src_summary_dict['independent_varname']
+        if not isinstance(self.dependent_varname, list) or \
+                not isinstance(self.independent_varname, list):
+            #TODO improve error message
+            plpy.error("Input table '{0}' has not been preprocessed properly. "
+                       "Please run input preprocessor again.".format(self.source_table))
         self.dep_shape_cols = [add_postfix(i, "_shape") for i in self.dependent_varname]
         self.ind_shape_cols = [add_postfix(i, "_shape") for i in self.independent_varname]
 
@@ -406,7 +411,7 @@ class FitCommonValidator(object):
                                input_shape, 2, True)
         if self.validation_table:
             InputValidator.validate_input_shape(
-                self.validation_table,  self.independent_varname,
+                self.validation_table,  self.val_ind_var,
                 input_shape, 2, True)
 
 
@@ -459,11 +464,12 @@ class FitMultipleInputValidator(FitCommonValidator):
                                                         use_gpus,
                                                         accessible_gpus_for_seg,
                                                         self.module_name,
-                                                        self.object_table,
-                                                        val_dep_var,
-                                                        val_ind_var)
-        self.output_model_info_table = add_postfix(output_model_table,
-                                                   '_info')
+                                                        self.object_table)
+        _assert(len(self.dependent_varname) == 1
+                or len(self.independent_varname) == 1,
+                "Multiple dependent and independent variables not supported "
+                "for madlib_keras_fit_multiple_model!")
+        self.output_model_info_table = add_postfix(output_model_table, '_info')
 
         if warm_start:
             input_tbl_valid(self.output_model_info_table, self.module_name)
diff --git a/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in b/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in
index 928b753..5ef4517 100644
--- a/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in
+++ b/src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras.py_in
@@ -907,14 +907,14 @@ class MadlibKerasPredictBYOMTestCase(unittest.TestCase):
         self.assertIn('invalid_pred_type', str(error.exception))
 
         # The validation for this test has been disabled
-        # with self.assertRaises(plpy.PLPYException) as error:
-        #     self.module.PredictBYOM('schema_madlib', 'model_arch_table',
-        #                              'model_id', 'test_table', 'id_col',
-        #                              'independent_varname', 'output_table',
-        #                              self.pred_type, self.use_gpus,
-        #                              ["foo", "bar", "baaz"], self.normalizing_const,
-        #                              self.dependent_count)
-        # self.assertIn('class values', str(error.exception).lower())
+        with self.assertRaises(plpy.PLPYException) as error:
+            self.module.PredictBYOM('schema_madlib', 'model_arch_table',
+                                     'model_id', 'test_table', 'id_col',
+                                     'independent_varname', 'output_table',
+                                     self.pred_type, self.use_gpus,
+                                     ["foo", "bar", "baaz"], self.normalizing_const,
+                                     self.dependent_count)
+        self.assertIn('class values', str(error.exception).lower())
 
         with self.assertRaises(plpy.PLPYException) as error:
             self.module.PredictBYOM('schema_madlib', 'model_arch_table',
@@ -1313,6 +1313,37 @@ class MadlibKerasFitCommonValidatorTestCase(unittest.TestCase):
             'module_name', None)
         self.assertEqual(False, obj._is_valid_metrics_compute_frequency())
 
+    def test_validator_dep_indep_type_not_array(self):
+        # only dep is not array
+        self.subject.FitCommonValidator.get_source_summary_table_dict = \
+            Mock(return_value={'dependent_varname':'a',
+                               'independent_varname':['b']})
+        with self.assertRaises(plpy.PLPYException) as error:
+            self.subject.FitCommonValidator(
+                'test_table', 'val_table', 'model_table', 5, None, False, False, [0],
+                'module_name', None)
+        self.assertIn('not been preprocessed properly', str(error.exception))
+
+        # only indep is not array
+        self.subject.FitCommonValidator.get_source_summary_table_dict = \
+            Mock(return_value={'dependent_varname':['a'],
+                               'independent_varname':'b'})
+        with self.assertRaises(plpy.PLPYException) as error:
+            self.subject.FitCommonValidator(
+                'test_table', 'val_table', 'model_table', 5, None, False, False, [0],
+                'module_name', None)
+        self.assertIn('not been preprocessed properly', str(error.exception))
+
+        # both indep and dep are not arrays
+        self.subject.FitCommonValidator.get_source_summary_table_dict = \
+            Mock(return_value={'dependent_varname':'a',
+                               'independent_varname':'b'})
+        with self.assertRaises(plpy.PLPYException) as error:
+            self.subject.FitCommonValidator(
+                'test_table', 'val_table', 'model_table', 5, None, False, False, [0],
+                'module_name', None)
+        self.assertIn('not been preprocessed properly', str(error.exception))
+
 
 class InputValidatorTestCase(unittest.TestCase):
     def setUp(self):
@@ -1391,9 +1422,9 @@ class InputValidatorTestCase(unittest.TestCase):
 
     def test_validate_input_shape_shapes_match(self):
         # minibatched data
-        # self.plpy_mock_execute.return_value = [{'shape': [1,32,32,3]}]
-        # self.subject.validate_input_shape(
-        #     self.test_table, [self.ind_var], [[32,32,3]], 2, True)
+        self.plpy_mock_execute.return_value = [{'shape': [1,32,32,3]}]
+        self.subject.validate_input_shape(
+            self.test_table, [self.ind_var], [[32,32,3]], 2, True)
         # non-minibatched data
         self.plpy_mock_execute.return_value = [{'shape': [32,32,3]}]
         self.subject.validate_input_shape(

[madlib] 01/04: DL: remove unused rotate import

Posted by nk...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

nkak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git

commit b00750b221e7a1d105fd4110e5fcd0ec31d68d81
Author: Nikhil Kak <nk...@vmware.com>
AuthorDate: Tue Jan 26 11:33:31 2021 -0800

    DL: remove unused rotate import
    
    JIRA: MADLIB-1464
    
    Co-authored-by: Ekta Khanna <ek...@vmware.com>
---
 .../modules/deep_learning/madlib_keras_fit_multiple_model.py_in      | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in
index 441c155..deda8f6 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.py_in
@@ -42,7 +42,6 @@ from utilities.control import SetGUC
 from utilities.utilities import add_postfix
 from utilities.utilities import is_platform_gp6_or_up
 from utilities.utilities import unique_string
-from utilities.utilities import rotate
 from utilities.utilities import madlib_version
 from utilities.utilities import is_platform_pg
 from utilities.utilities import get_seg_number
@@ -257,8 +256,8 @@ class FitMultipleModel(object):
         # Ordered list of sql representations of each mst_key,
         #  including NULL's.  This will be used to pass the mst keys
         #  to the db as a sql ARRAY[]
-        self.all_mst_keys = [ str(mst['mst_key']) if mst else 'NULL'\
-                for mst in self.msts_for_schedule ]
+        self.all_mst_keys = [ str(mst['mst_key']) if mst else 'NULL' \
+                              for mst in self.msts_for_schedule ]
 
         # List of all dist_keys, including any extra dist keys beyond
         #  the # segments we'll be training on--these represent the