You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by GitBox <gi...@apache.org> on 2020/12/18 18:39:58 UTC

[GitHub] [madlib] reductionista opened a new pull request #526: dl/automl encapsulation

reductionista opened a new pull request #526:
URL: https://github.com/apache/madlib/pull/526


   Following the principal of encapsulation, AutoML should keep track of its own temporary table names, instead of relying on assumptions about where they will be stored internally to the FitMultipleModel class.
   
   These changes allow it to co-exist with the Model Hopper refactor (but are independent from that PR), and should help make AutoML more robust to any future changes we make inside FitMultipleModel.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista merged pull request #526: DL: AutoML encapsulation

Posted by GitBox <gi...@apache.org>.
reductionista merged pull request #526:
URL: https://github.com/apache/madlib/pull/526


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on a change in pull request #526: DL: AutoML encapsulation

Posted by GitBox <gi...@apache.org>.
reductionista commented on a change in pull request #526:
URL: https://github.com/apache/madlib/pull/526#discussion_r554205538



##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -172,45 +174,38 @@ class KerasAutoML(object):
         random_state = 'NULL' if self.random_state is None else '$MAD${0}$MAD$'.format(self.random_state)
         validation_table = 'NULL' if self.validation_table is None else '$MAD${0}$MAD$'.format(self.validation_table)
 
-        create_query = plpy.prepare("""
+        create_query = """
                 CREATE TABLE {self.model_summary_table} AS
                 SELECT
                     $MAD${self.source_table}$MAD$::TEXT AS source_table,
                     {validation_table}::TEXT AS validation_table,
                     $MAD${self.model_output_table}$MAD$::TEXT AS model,
                     $MAD${self.model_info_table}$MAD$::TEXT AS model_info,
-                    (SELECT dependent_varname FROM {model_training.model_summary_table})
-                    AS dependent_varname,
-                    (SELECT independent_varname FROM {model_training.model_summary_table})
-                    AS independent_varname,
+                    (SELECT dependent_varname FROM {a.MODEL_SUMMARY_TABLE})

Review comment:
       My intention with this PR was to try to avoid as much as possible having AutoML depend on the names and meanings of internal class variables in FitMultiple.  In the case of the model output table, there were multiple different table names inside this class, and while refactoring they changed around.  So the motivation was mainly to avoid the headache that resulted from having to go through and modify all the references to those variables in AutoML again, if they change in the future.
   
   The name of the model output table is chosen by AutoML rather than by FitMultiple, so there was a strong case for having AutoML manage that one.  For the summary table, I agree that the case is weaker... since the name is not entirely chosen by AutoML... the _summary prefix is appended to whatever AutoML chooses by FitMultiple.
   
   I considered leaving that one as-is, but decided it seems better to have AutoML add the _summary itself for a couple reasons:  first, if we do change anything in the future in FitMultiple, it's much more likely to be the class variable name that changes than the convention of adding '_summary', since that is part of the user-facing API:  user specifies what the output table name is, and they should expect _summary to be appended as described in the docs.  AutoML is using FitMultiple in a way similar to how a user would... so it will only break if we break backwards compatibility for the user... but won't break if we refactor the code inside FitMultiple.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] kaknikhil commented on a change in pull request #526: DL: AutoML encapsulation

Posted by GitBox <gi...@apache.org>.
kaknikhil commented on a change in pull request #526:
URL: https://github.com/apache/madlib/pull/526#discussion_r552291582



##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -79,6 +82,7 @@ class KerasAutoML(object):
         self.model_id_list = sorted(list(set(model_id_list)))
         self.compile_params_grid = compile_params_grid
         self.fit_params_grid = fit_params_grid
+        self.dist_key_col = DISTRIBUTION_KEY_COLNAME

Review comment:
       Why do we need to add the dist_key column for the automl model_output_table ? It might also help to add a description in the commit message about why we need the dist key in the model_output_table. 

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -172,45 +174,38 @@ class KerasAutoML(object):
         random_state = 'NULL' if self.random_state is None else '$MAD${0}$MAD$'.format(self.random_state)
         validation_table = 'NULL' if self.validation_table is None else '$MAD${0}$MAD$'.format(self.validation_table)
 
-        create_query = plpy.prepare("""
+        create_query = """
                 CREATE TABLE {self.model_summary_table} AS
                 SELECT
                     $MAD${self.source_table}$MAD$::TEXT AS source_table,
                     {validation_table}::TEXT AS validation_table,
                     $MAD${self.model_output_table}$MAD$::TEXT AS model,
                     $MAD${self.model_info_table}$MAD$::TEXT AS model_info,
-                    (SELECT dependent_varname FROM {model_training.model_summary_table})
-                    AS dependent_varname,
-                    (SELECT independent_varname FROM {model_training.model_summary_table})
-                    AS independent_varname,
+                    (SELECT dependent_varname FROM {a.MODEL_SUMMARY_TABLE})

Review comment:
       I am assuming we are creating the automl summary table by querying from the fit multiple summary table. If that's the case, I think it's better to get the fit multiple summary table name from the model_training object since that's the source of truth.
   What do you think ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on a change in pull request #526: DL: AutoML encapsulation

Posted by GitBox <gi...@apache.org>.
reductionista commented on a change in pull request #526:
URL: https://github.com/apache/madlib/pull/526#discussion_r554205538



##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -172,45 +174,38 @@ class KerasAutoML(object):
         random_state = 'NULL' if self.random_state is None else '$MAD${0}$MAD$'.format(self.random_state)
         validation_table = 'NULL' if self.validation_table is None else '$MAD${0}$MAD$'.format(self.validation_table)
 
-        create_query = plpy.prepare("""
+        create_query = """
                 CREATE TABLE {self.model_summary_table} AS
                 SELECT
                     $MAD${self.source_table}$MAD$::TEXT AS source_table,
                     {validation_table}::TEXT AS validation_table,
                     $MAD${self.model_output_table}$MAD$::TEXT AS model,
                     $MAD${self.model_info_table}$MAD$::TEXT AS model_info,
-                    (SELECT dependent_varname FROM {model_training.model_summary_table})
-                    AS dependent_varname,
-                    (SELECT independent_varname FROM {model_training.model_summary_table})
-                    AS independent_varname,
+                    (SELECT dependent_varname FROM {a.MODEL_SUMMARY_TABLE})

Review comment:
       My intention with this PR was to try to avoid as much as possible having AutoML depend on the names and meanings of internal class variables in FitMultiple.  In the case of the model output table, there were multiple different table names inside this case, and while refactoring they changed around.  So the motivation was mainly to avoid the headache that resulted from having to go through and modify all the references to those variables in AutoML again, if they change in the future.
   
   The name of the model output table is chosen by AutoML rather than by FitMultiple, so the was a strong case for having AutoML manage that one.  For the summary table, I agree that the case is weaker... since the name is not entirely chosen by AutoML... the _summary prefix is appended to whatever AutoML chooses by FitMultiple.
   
   I considered leaving that one as-is, but decided it seems better to have AutoML add the _summary itself for a couple reasons:  first, if we do change anything in the future in FitMultiple, it's much more likely to be the class variable name that changes than the convention of adding '_summary', since that is part of the user-facing API:  user specifies what the output table name is, and they should expect _summary to be appended as described in the docs.  AutoML is using FitMultiple in a way similar to how a user would... so it will only break if we break backwards compatibility for the user... but won't break if we refactor the code inside FitMultiple.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on a change in pull request #526: DL: AutoML encapsulation

Posted by GitBox <gi...@apache.org>.
reductionista commented on a change in pull request #526:
URL: https://github.com/apache/madlib/pull/526#discussion_r554188957



##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -79,6 +82,7 @@ class KerasAutoML(object):
         self.model_id_list = sorted(list(set(model_id_list)))
         self.compile_params_grid = compile_params_grid
         self.fit_params_grid = fit_params_grid
+        self.dist_key_col = DISTRIBUTION_KEY_COLNAME

Review comment:
       Currently it's not required, but while working on the Model Hopper refactor I realized it would help a lot with warm start if we eventually do require it.
   
   As I was working on optimizing weight initialization, I realized if we could rely on model output tables always having a dist key, then that would speed things up and avoid unnecessary work.  Otherwise the first step has to be copying the table over to one which does have the dist key, which usually involves shuffling the weights around to different segments.  If there is no dist key, then we can't assume anything about how the weights are distributed so there is no way to optimize that part.   For all we (or gpdb) knows, all of the weights might be on the same segment with none on any other segments.
   
   All newly generated output tables will have the dist key in them (I should make that change to fit also, come to think of it), but because they won't for v1.17 I don't require it as an input for warm start yet... we still do the extra unnecessary shuffling each time for backwards compatibility.
   
   So there's nothing necessary about it right now, but the earlier we get this into the codebase the earlier we can drop compatibility for warm start on output tables missing a dist key.

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -172,45 +174,38 @@ class KerasAutoML(object):
         random_state = 'NULL' if self.random_state is None else '$MAD${0}$MAD$'.format(self.random_state)
         validation_table = 'NULL' if self.validation_table is None else '$MAD${0}$MAD$'.format(self.validation_table)
 
-        create_query = plpy.prepare("""
+        create_query = """
                 CREATE TABLE {self.model_summary_table} AS
                 SELECT
                     $MAD${self.source_table}$MAD$::TEXT AS source_table,
                     {validation_table}::TEXT AS validation_table,
                     $MAD${self.model_output_table}$MAD$::TEXT AS model,
                     $MAD${self.model_info_table}$MAD$::TEXT AS model_info,
-                    (SELECT dependent_varname FROM {model_training.model_summary_table})
-                    AS dependent_varname,
-                    (SELECT independent_varname FROM {model_training.model_summary_table})
-                    AS independent_varname,
+                    (SELECT dependent_varname FROM {a.MODEL_SUMMARY_TABLE})

Review comment:
       My intention with this PR was to try to avoid as much as possible having AutoML depend on the names and meanings of internal class variables in FitMultiple.  In the case of the model output table, there were multiple different table names inside this case, and while refactoring they changed around.  So the motivation was mainly to avoid the headache that resulted from having to go through and modify all the references to those variables in AutoML again, if they change in the future.
   
   The name of the model output table is chosen by AutoML rather than by FitMultiple, so the was a strong case for having AutoML manage that one.  For the summary table, I agree that the case is weaker... since the name is not entirely chosen by AutoML... the _summary prefix is appended to whatever AutoML chooses by FitMultiple.
   
   I considered leaving that one as-is, but decided it seems better to have AutoML add the _summary itself for a couple reasons:  first, if we do change anything in the future in FitMultiple, it's much more likely to be the class variable name that changes than the convention of adding '_summary', since that is part of the user-facing API:  user specifies what the output table name is, and they should expect _summary to be appended as described in the docs.  AutoML is using FitMultiple in a way similar to how a user would... so it will only break if we break backwards compatibility for the user... but won't break if we refactor the code inside FitMultiple.

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -172,45 +174,38 @@ class KerasAutoML(object):
         random_state = 'NULL' if self.random_state is None else '$MAD${0}$MAD$'.format(self.random_state)
         validation_table = 'NULL' if self.validation_table is None else '$MAD${0}$MAD$'.format(self.validation_table)
 
-        create_query = plpy.prepare("""
+        create_query = """
                 CREATE TABLE {self.model_summary_table} AS
                 SELECT
                     $MAD${self.source_table}$MAD$::TEXT AS source_table,
                     {validation_table}::TEXT AS validation_table,
                     $MAD${self.model_output_table}$MAD$::TEXT AS model,
                     $MAD${self.model_info_table}$MAD$::TEXT AS model_info,
-                    (SELECT dependent_varname FROM {model_training.model_summary_table})
-                    AS dependent_varname,
-                    (SELECT independent_varname FROM {model_training.model_summary_table})
-                    AS independent_varname,
+                    (SELECT dependent_varname FROM {a.MODEL_SUMMARY_TABLE})

Review comment:
       My intention with this PR was to try to avoid as much as possible having AutoML depend on the names and meanings of internal class variables in FitMultiple.  In the case of the model output table, there were multiple different table names inside this class, and while refactoring they changed around.  So the motivation was mainly to avoid the headache that resulted from having to go through and modify all the references to those variables in AutoML again, if they change in the future.
   
   The name of the model output table is chosen by AutoML rather than by FitMultiple, so there was a strong case for having AutoML manage that one.  For the summary table, I agree that the case is weaker... since the name is not entirely chosen by AutoML... the _summary prefix is appended to whatever AutoML chooses by FitMultiple.
   
   I considered leaving that one as-is, but decided it seems better to have AutoML add the _summary itself for a couple reasons:  first, if we do change anything in the future in FitMultiple, it's much more likely to be the class variable name that changes than the convention of adding '_summary', since that is part of the user-facing API:  user specifies what the output table name is, and they should expect _summary to be appended as described in the docs.  AutoML is using FitMultiple in a way similar to how a user would... so it will only break if we break backwards compatibility for the user... but won't break if we refactor the code inside FitMultiple.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on a change in pull request #526: DL: AutoML encapsulation

Posted by GitBox <gi...@apache.org>.
reductionista commented on a change in pull request #526:
URL: https://github.com/apache/madlib/pull/526#discussion_r554188957



##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -79,6 +82,7 @@ class KerasAutoML(object):
         self.model_id_list = sorted(list(set(model_id_list)))
         self.compile_params_grid = compile_params_grid
         self.fit_params_grid = fit_params_grid
+        self.dist_key_col = DISTRIBUTION_KEY_COLNAME

Review comment:
       Currently it's not required, but while working on the Model Hopper refactor I realized it would help a lot with warm start if we eventually do require it.
   
   As I was working on optimizing weight initialization, I realized if we could rely on model output tables always having a dist key, then that would speed things up and avoid unnecessary work.  Otherwise the first step has to be copying the table over to one which does have the dist key, which usually involves shuffling the weights around to different segments.  If there is no dist key, then we can't assume anything about how the weights are distributed so there is no way to optimize that part.   For all we (or gpdb) knows, all of the weights might be on the same segment with none on any other segments.
   
   All newly generated output tables will have the dist key in them (I should make that change to fit also, come to think of it), but because they won't for v1.17 I don't require it as an input for warm start yet... we still do the extra unnecessary shuffling each time for backwards compatibility.
   
   So there's nothing necessary about it right now, but the earlier we get this into the codebase the earlier we can drop compatibility for warm start on output tables missing a dist key.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] khannaekta commented on a change in pull request #526: DL: AutoML encapsulation

Posted by GitBox <gi...@apache.org>.
khannaekta commented on a change in pull request #526:
URL: https://github.com/apache/madlib/pull/526#discussion_r555388554



##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_automl.py_in
##########
@@ -172,45 +183,38 @@ class KerasAutoML(object):
         random_state = 'NULL' if self.random_state is None else '$MAD${0}$MAD$'.format(self.random_state)
         validation_table = 'NULL' if self.validation_table is None else '$MAD${0}$MAD$'.format(self.validation_table)
 
-        create_query = plpy.prepare("""
+        create_query = """
                 CREATE TABLE {self.model_summary_table} AS
                 SELECT
                     $MAD${self.source_table}$MAD$::TEXT AS source_table,
                     {validation_table}::TEXT AS validation_table,
                     $MAD${self.model_output_table}$MAD$::TEXT AS model,
                     $MAD${self.model_info_table}$MAD$::TEXT AS model_info,
-                    (SELECT dependent_varname FROM {model_training.model_summary_table})
-                    AS dependent_varname,
-                    (SELECT independent_varname FROM {model_training.model_summary_table})
-                    AS independent_varname,
+                    (SELECT dependent_varname FROM {a.MODEL_SUMMARY_TABLE})
+                        AS dependent_varname,
+                    (SELECT independent_varname FROM {a.MODEL_SUMMARY_TABLE})
+                        AS independent_varname,
                     $MAD${self.model_arch_table}$MAD$::TEXT AS model_arch_table,
                     $MAD${self.model_selection_table}$MAD$::TEXT AS model_selection_table,
                     $MAD${self.automl_method}$MAD$::TEXT AS automl_method,
                     $MAD${self.automl_params}$MAD$::TEXT AS automl_params,
                     {random_state}::TEXT AS random_state,
                     {object_table}::TEXT AS object_table,
                     {self.use_gpus} AS use_gpus,
-                    (SELECT metrics_compute_frequency FROM {model_training.model_summary_table})::INTEGER
-                    AS metrics_compute_frequency,
-                    {name}::TEXT AS name,
-                    {descr}::TEXT AS description,
+                    (SELECT metrics_compute_frequency FROM {a.MODEL_SUMMARY_TABLE})::INTEGER 
+                        AS metrics_compute_frequency,
+                    $MAD${self.name}$MAD$::TEXT AS name,
+                    $MAD${self.description}$MAD$::TEXT AS description,

Review comment:
       ```suggestion
                       {name}::TEXT AS name,
                       {descr}::TEXT AS description,
   ```
   As we have the code to check for NULL and add `$MAD$` in L180-181. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org