You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by GitBox <gi...@apache.org> on 2020/07/20 21:13:34 UTC

[GitHub] [madlib] Advitya17 opened a new pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Advitya17 opened a new pull request #506:
URL: https://github.com/apache/madlib/pull/506


   JIRA: MADLIB-1439
   
   The `load_model_selection_table` function requires the user to manually specify the grid of compile and fit params. Hence, we implement a function called `generate_model_selection_configs` (in the same module) to perform grid/random search.
   
   The user would declare the compile and fit params grid separately as strings enveloping python dictionaries, and the name of the search algorithm (along with any corresponding arguments), and the output format of the new function would be the same as the previous one for better integrability with other MADlib functions related to model training etc. 
   
   This pull request includes the implementation, unit tests (in python and SQL) and documentation for the newly created function.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-665993038






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] khannaekta commented on a change in pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
khannaekta commented on a change in pull request #506:
URL: https://github.com/apache/madlib/pull/506#discussion_r465932779



##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_wrapper.py_in
##########
@@ -201,13 +201,14 @@ def parse_and_validate_compile_params(str_of_args):
     """
     literal_eval_compile_params = ['metrics', 'loss_weights',
                                    'weighted_metrics', 'sample_weight_mode']
-    accepted_compile_params = literal_eval_compile_params + ['optimizer', 'loss']
+    accepted_compile_params = literal_eval_compile_params + ['optimizer', 'loss', 'optimizer_params_list']
 
     compile_dict = convert_string_of_args_to_dict(str_of_args)
     compile_dict = validate_and_literal_eval_keys(compile_dict,
                                                   literal_eval_compile_params,
                                                   accepted_compile_params)
-    _assert('optimizer' in compile_dict, "optimizer is a required parameter for compile")
+    # _assert('optimizer' in compile_dict, "optimizer is a required parameter for compile")

Review comment:
       Remove these commented out code.

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_model_selection.py_in
##########
@@ -442,28 +468,43 @@ class MstSearch():
             if self.random_state:
                 np.random.seed(self.random_state+seed_changes)
                 seed_changes += 1
-
             param_values = params_dict[cp]
-
-            # sampling from a distribution
-            if param_values[-1] in ['linear', 'log']:
-                _assert(len(param_values) == 3,
-                        "DL: {0} should have exactly 3 elements if picking from a distribution".format(cp))
-                _assert(param_values[1] > param_values[0],
-                        "DL: {0} should be of the format [lower_bound, uppper_bound, distribution_type]".format(cp))
-                if param_values[-1] == 'linear':
-                    config_dict[cp] = np.random.uniform(param_values[0], param_values[1])
-                elif param_values[-1] == 'log':
-                    config_dict[cp] = np.power(10, np.random.uniform(np.log10(param_values[0]),
-                                                                     np.log10(param_values[1])))
-                else:
-                    plpy.error("DL: Choose a valid distribution type! ('linear' or 'log')")
-            # random sampling
+            if cp == ModelSelectionSchema.OPTIMIZER_PARAMS_LIST:
+                opt_dict = np.random.choice(param_values)
+                opt_combination = {}
+                for i in opt_dict:
+                    opt_values = opt_dict[i]
+                    if self.random_state:
+                        np.random.seed(self.random_state+seed_changes)
+                        seed_changes += 1
+                    opt_combination[i] = self.sample_val(cp, opt_values)
+                config_dict[cp] = opt_combination
             else:
-                config_dict[cp] = np.random.choice(params_dict[cp])
-
+                config_dict[cp] = self.sample_val(cp, param_values)
         return config_dict, seed_changes
 
+    def sample_val(self, cp, param_value_list):

Review comment:
       Good to have a comment for what this function does.

##########
File path: src/ports/postgres/modules/deep_learning/test/unit_tests/test_madlib_keras_model_selection_table.py_in
##########
@@ -77,6 +373,8 @@ class LoadModelSelectionTableTestCase(unittest.TestCase):
             "batch_size=10,epochs=1"
         ]
 
+        # plpy.execute("SELECT * FROM invalid_table;")

Review comment:
       we can remove this comment code




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-665993038


   errors and issues
   
   (1)
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'linear']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            5, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   ```
   produces
   ```
   InternalError: (psycopg2.errors.InternalError_) TypeError: cannot concatenate 'str' and 'float' objects (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 21, in <module>
       mst_loader = madlib_keras_model_selection.MstSearch(**globals())
     PL/Python function "generate_model_selection_configs", line 42, in wrapper
     PL/Python function "generate_model_selection_configs", line 287, in __init__
     PL/Python function "generate_model_selection_configs", line 426, in find_random_combinations
     PL/Python function "generate_model_selection_configs", line 490, in generate_row_string
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'lr': [0.0001, 0.1, 'linear']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            5, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/2j85)
   ```
   
   Likewise
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'log'],
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            1, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   produces
   ```
   InternalError: (psycopg2.errors.InternalError_) TypeError: cannot concatenate 'str' and 'numpy.float64' objects (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 21, in <module>
       mst_loader = madlib_keras_model_selection.MstSearch(**globals())
     PL/Python function "generate_model_selection_configs", line 42, in wrapper
     PL/Python function "generate_model_selection_configs", line 287, in __init__
     PL/Python function "generate_model_selection_configs", line 426, in find_random_combinations
     PL/Python function "generate_model_selection_configs", line 490, in generate_row_string
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'log'],
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            1, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/2j85)
   ```
   
   (2)
   For search_type = 'grid' or 'random', use should be able to enter part of the string, e.g., 'rand' for random or 'g' for for grid.  There is a MADlib function that supports this.
   
   
   (3)
   change the name of the function from `generate_model_selection_configs`
   to `generate_model_configs`
   
   
   (4)
   remove exclamations ! from error messages and random capitalization. Suggested messages:
   
   "DL: 'num_configs' and 'random_state' must be NULL for grid search"
   
   "DL: Cannot search from a distribution with grid search"
   
   "DL: 'num_configs' cannot be NULL for random search"
   
   "DL: 'search_type' must be either 'grid' or 'random'"
   
   "DL: Please choose a valid distribution type ('linear' or 'log')"
   
   "DL: {0} should be of the format [lower_bound, upper_bound, distribution_type]"
   
   
   (5)
   In addition to `linear` sampling and `log` sampling we should add another type
   called `log_near_one`
   ```
   config_dict[cp] = 1.0 - np.power(10,  np.random.uniform (np.log10 (1.0 - param_values[1]), np.log10(1.0 - param_values[0]) ) )
   ```
   This type of sampling is useful for exponentially weighted average type params like momentum, which are very sensitive to changes near 1.  It has the effect of producing more values near 1 than regular log sampling.
   
   e.g.
   momentum values in range [0.9000, 0.9005] average the prev 10 values no matter where you are in the range (no diff)
   but
   momentum values in range [0.9990, 0.9995] average the prev 1000 values for the left side and prev 2000 values for the right side (big diff), so you want to generate more samples nearer to the right side to get better coverage.
   
   
   (6)
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['Adam'],
                                             'lr': [0.9, 0.95, 'log'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   followed by 
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['SGD'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   produces
   ```
   IntegrityError: (psycopg2.errors.UniqueViolation) plpy.SPIError: duplicate key value violates unique constraint "mst_table_model_id_key"  (seg0 10.128.0.41:40000 pid=22297)
   DETAIL:  Key (model_id, compile_params, fit_params)=(1, optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy', epochs=12,batch_size=32) already exists.
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 22, in <module>
       mst_loader.load()
     PL/Python function "generate_model_selection_configs", line 313, in load
     PL/Python function "generate_model_selection_configs", line 566, in insert_into_mst_table
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs( 'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['SGD'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/gkpj)
   ```
   But it only produced the error every 2nd time I did this. i.e., 1-pass it would work then the 2nd pass it would throw the error.
   
   When it does pass, it produces
   ```
    mst_key | model_id |                                        compile_params                                        |        fit_params        
   ---------+----------+----------------------------------------------------------------------------------------------+--------------------------
          1 |        1 | optimizer='Adam(lr=0.9063214445649174)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=10,batch_size=256
          2 |        1 | optimizer='Adam(lr=0.9367722192055232)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=5,batch_size=256
          3 |        1 | optimizer='Adam(lr=0.9212048311857509)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=2,batch_size=32
          4 |        1 | optimizer='Adam(lr=0.9193149125403647)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=3,batch_size=256
          5 |        1 | optimizer='Adam(lr=0.9326284661833211)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=2,batch_size=256
          6 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=10,batch_size=256
          7 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=5,batch_size=8
          8 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=2,batch_size=1024
          9 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=3,batch_size=32
         10 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=12,batch_size=8
   (10 rows)
   ```
   is `optimizer='SGD()'...` correct or should it be `optimizer='SGD'...` ?
   
   
   (7)
   Not all sub-params apply to all params.  For example, for optimizer, `lr` and `decay` might only apply to certain optimizer types and not others:
   ```
   optimizer='SGD'
   optimizer='rmsprop(lr=0.0001, decay=1e-6)'
   optimizer='adam(lr=0.0001)'
   ```
   In the previous method we accounted for that by doing:
   ```
   SELECT madlib.load_model_selection_table('model_arch_library', -- model architecture table
                                            'mst_table',          -- model selection table output
                                             ARRAY[1,2],          -- model ids from model architecture table
                                             ARRAY[               -- compile params   
                                                 $$loss='categorical_crossentropy',optimizer='rmsprop(lr=0.0001, decay=1e-6)',metrics=['accuracy']$$,
                                                 $$loss='categorical_crossentropy',optimizer='rmsprop(lr=0.001, decay=1e-6)',metrics=['accuracy']$$,
                                                 $$loss='categorical_crossentropy',optimizer='adam(lr=0.0001)',metrics=['accuracy']$$,
                                                 $$loss='categorical_crossentropy',optimizer='adam(lr=0.001)',metrics=['accuracy']$$
                                             ],
                                             ARRAY[                -- fit params
                                                 $$batch_size=64,epochs=5$$, 
                                                 $$batch_size=128,epochs=5$$
                                             ]
                                            );
   ```
   but how do we do this in the new method `generate_model_configs`? You could call it multiple times and incrementally build up the `mst_table` but when autoML methods call this function we need to support a 1-shot manner.  I would suggest nested dictionaries like:
   ```
   SELECT madlib.generate_model_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                              'my_list': [
                                               				{'optimizer': ['SGD', 'Adagrad']},
                                               				{'optimizer': ['rmsprop'], 'lr': [0.9, 0.95, 'log'], 'decay': [1e-6, 1e-4, 'log']},
                                               				{'optimizer': ['Adam'], 'lr': [0.99, 0.995, 'log']}
                                              			   ],
                                              'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] Advitya17 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
Advitya17 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-669530020


   @khannaekta I have made the changes specified above.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] Advitya17 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
Advitya17 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668955356


   ![image](https://user-images.githubusercontent.com/37190647/89367810-d70ee880-d68e-11ea-8284-cfd1b30ece2e.png)
   
   According to the keras documentation, it seems there is a default value for optimizer as well (just that it seems to be rmsprop instead of my manually chosen SGD). 
   
   From my understanding of the documentation and its language, yes optimizer is required for compiling a keras model, but they seem to already be using a default optimizer (with default params) to compile the keras model when the user may not specify one. 
   
   Please correct me if I missed anything or didn't understand it right.
   
   ![image](https://user-images.githubusercontent.com/37190647/89367845-edb53f80-d68e-11ea-999c-7353b33857d3.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] Advitya17 commented on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
Advitya17 commented on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668955356


   ![image](https://user-images.githubusercontent.com/37190647/89367810-d70ee880-d68e-11ea-8284-cfd1b30ece2e.png)
   
   According to the keras documentation, it seems there is a default value for optimizer as well (just that it seems to be rmsprop instead of my manually chosen SGD). 
   
   From my understanding of the documentation and its language, yes optimizer is required for compiling a keras model, but they seem to already be using a default optimizer (with default params) to compile the keras model when the user may not specify one. Please correct me if I missed anything or didn't understand it right.
   
   ![image](https://user-images.githubusercontent.com/37190647/89367845-edb53f80-d68e-11ea-999c-7353b33857d3.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] Advitya17 commented on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
Advitya17 commented on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668949266


   (8) I assume it's the right syntax (as part of the keras documentation)
   ![image](https://user-images.githubusercontent.com/37190647/89366896-a1690000-d68c-11ea-9e82-21b98bffffb9.png)
   
   (9) Yes, I currently have SGD as the default when the user wishes to tune any optimizer params but doesn't specify an optimizer to tune.
   
   Does that help?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] Advitya17 commented on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
Advitya17 commented on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-669330584


   I have switched the default optimizer from SGD to RMSprop. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] Advitya17 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
Advitya17 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668955356


   ![image](https://user-images.githubusercontent.com/37190647/89367810-d70ee880-d68e-11ea-8284-cfd1b30ece2e.png)
   
   According to the keras documentation, it seems there is a default value for optimizer as well (just that it seems to be rmsprop instead of my manually chosen SGD). 
   
   From my understanding of the documentation and its language, yes optimizer is required for compiling a keras model, but they seem to already be using a default optimizer (and with default params) to compile the keras model when the user may not specify one. 
   
   Please correct me if I missed anything or didn't understand it right.
   
   ![image](https://user-images.githubusercontent.com/37190647/89367845-edb53f80-d68e-11ea-999c-7353b33857d3.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] Advitya17 commented on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
Advitya17 commented on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-669530020


   @khannaekta I have make the changes specified above.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] orhankislal merged pull request #506: DL: Add grid/random search for model selection with `generate_model_configs`

Posted by GitBox <gi...@apache.org>.
orhankislal merged pull request #506:
URL: https://github.com/apache/madlib/pull/506


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-665993038


   errors and issues
   
   (1)
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'linear']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            5, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   ```
   produces
   ```
   InternalError: (psycopg2.errors.InternalError_) TypeError: cannot concatenate 'str' and 'float' objects (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 21, in <module>
       mst_loader = madlib_keras_model_selection.MstSearch(**globals())
     PL/Python function "generate_model_selection_configs", line 42, in wrapper
     PL/Python function "generate_model_selection_configs", line 287, in __init__
     PL/Python function "generate_model_selection_configs", line 426, in find_random_combinations
     PL/Python function "generate_model_selection_configs", line 490, in generate_row_string
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'lr': [0.0001, 0.1, 'linear']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            5, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/2j85)
   ```
   
   Likewise
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'log'],
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            1, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   produces
   ```
   InternalError: (psycopg2.errors.InternalError_) TypeError: cannot concatenate 'str' and 'numpy.float64' objects (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 21, in <module>
       mst_loader = madlib_keras_model_selection.MstSearch(**globals())
     PL/Python function "generate_model_selection_configs", line 42, in wrapper
     PL/Python function "generate_model_selection_configs", line 287, in __init__
     PL/Python function "generate_model_selection_configs", line 426, in find_random_combinations
     PL/Python function "generate_model_selection_configs", line 490, in generate_row_string
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'log'],
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            1, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/2j85)
   ```
   
   (2)
   For search_type = 'grid' or 'random', use should be able to enter part of the string, e.g., 'rand' for random or 'g' for for grid.  There is a MADlib function that supports this.
   
   
   (3)
   change the name of the function from `generate_model_selection_configs`
   to `generate_model_configs`
   
   
   (4)
   remove exclamations ! from error messages and random capitalization. Suggested messages:
   
   "DL: 'num_configs' and 'random_state' must be NULL for grid search"
   
   "DL: Cannot search from a distribution with grid search"
   
   "DL: 'num_configs' cannot be NULL for random search"
   
   "DL: 'search_type' must be either 'grid' or 'random'"
   
   "DL: Please choose a valid distribution type ('linear' or 'log')"
   
   "DL: {0} should be of the format [lower_bound, upper_bound, distribution_type]"
   
   
   (5)
   In addition to `linear` sampling and `log` sampling we should add another type
   called `log_near_one`
   ```
   config_dict[cp] = 1.0 - np.power(10,  np.random.uniform (np.log10 (1.0 - param_values[1]), np.log10(1.0 - param_values[0]) ) )
   ```
   This type of sampling is useful for exponentially weighted average type params like momentum, which are very sensitive to changes near 1.  It has the effect of producing more values near 1 than regular log sampling.
   
   e.g.
   momentum values in range [0.9000, 0.9005] average the prev 10 values no matter where you are in the range (no diff)
   but
   momentum values in range [0.9990, 0.9995] average the prev 1000 values for the left side and prev 2000 values for the right side (big diff), so you want to generate more samples nearer to the right side to get better coverage.
   
   
   (6)
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['Adam'],
                                             'lr': [0.9, 0.95, 'log'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   followed by 
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['SGD'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   produces
   ```
   IntegrityError: (psycopg2.errors.UniqueViolation) plpy.SPIError: duplicate key value violates unique constraint "mst_table_model_id_key"  (seg0 10.128.0.41:40000 pid=22297)
   DETAIL:  Key (model_id, compile_params, fit_params)=(1, optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy', epochs=12,batch_size=32) already exists.
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 22, in <module>
       mst_loader.load()
     PL/Python function "generate_model_selection_configs", line 313, in load
     PL/Python function "generate_model_selection_configs", line 566, in insert_into_mst_table
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs( 'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['SGD'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/gkpj)
   ```
   But it only produced the error every 2nd time I did this. i.e., 1-pass it would work then the 2nd pass it would throw the error.
   
   When it does pass, it produces
   ```
    mst_key | model_id |                                        compile_params                                        |        fit_params        
   ---------+----------+----------------------------------------------------------------------------------------------+--------------------------
          1 |        1 | optimizer='Adam(lr=0.9063214445649174)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=10,batch_size=256
          2 |        1 | optimizer='Adam(lr=0.9367722192055232)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=5,batch_size=256
          3 |        1 | optimizer='Adam(lr=0.9212048311857509)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=2,batch_size=32
          4 |        1 | optimizer='Adam(lr=0.9193149125403647)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=3,batch_size=256
          5 |        1 | optimizer='Adam(lr=0.9326284661833211)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=2,batch_size=256
          6 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=10,batch_size=256
          7 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=5,batch_size=8
          8 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=2,batch_size=1024
          9 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=3,batch_size=32
         10 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=12,batch_size=8
   (10 rows)
   ```
   is `optimizer='SGD()'...` correct or should it be `optimizer='SGD'...` ?
   
   
   (7)
   Not all sub-params apply to all params.  For example, for optimizer, `lr` and `decay` might only apply to certain optimizer types and not others:
   ```
   optimizer='SGD'
   optimizer='rmsprop(lr=0.0001, decay=1e-6)'
   optimizer='adam(lr=0.0001)'
   ```
   In the previous method we accounted for that by doing:
   ```
   SELECT madlib.load_model_selection_table('model_arch_library', -- model architecture table
                                            'mst_table',          -- model selection table output
                                             ARRAY[1,2],          -- model ids from model architecture table
                                             ARRAY[               -- compile params   
                                                 $$loss='categorical_crossentropy',optimizer='rmsprop(lr=0.0001, decay=1e-6)',metrics=['accuracy']$$,
                                                 $$loss='categorical_crossentropy',optimizer='rmsprop(lr=0.001, decay=1e-6)',metrics=['accuracy']$$,
                                                 $$loss='categorical_crossentropy',optimizer='adam(lr=0.0001)',metrics=['accuracy']$$,
                                                 $$loss='categorical_crossentropy',optimizer='adam(lr=0.001)',metrics=['accuracy']$$
                                             ],
                                             ARRAY[                -- fit params
                                                 $$batch_size=64,epochs=5$$, 
                                                 $$batch_size=128,epochs=5$$
                                             ]
                                            );
   ```
   but how do we do this in the new method `generate_model_configs`? You could call it multiple times and incrementally build up the `mst_table` but when autoML methods call this function we need to support a 1-shot manner.  I would suggest nested dictionaries like:
   ```
   SELECT madlib.generate_model_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                              'my_list': [
                                               				{'optimizer': ['SGD', 'Adagrad']},
                                               				{'optimizer': ['rmsprop'], 'lr': [0.9, 0.95, 'log'], 'decay': [1e-6, 1e-4, 'log']},
                                               				{'optimizer': ['Adam'], 'lr': [0.99, 0.995, 'log']}
                                              			   ],
                                              'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
   ```
   So I think we should support both single dictionary and nested dictionary syntax.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668952722


   for (9) 
   
   Keras docs say that optimizer and loss function are mandatory params, so I don't think we should put SGD if user does not specify an optimizer.  We should let it fail, no?
   
   https://keras.io/api/optimizers/
   `An optimizer is one of the two arguments required for compiling a Keras model`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 commented on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 commented on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-667323102


   Actually #6 is fine, I should have read this in more detail
   ```IntegrityError: (psycopg2.errors.UniqueViolation) plpy.SPIError: duplicate key value violates unique constraint "mst_table_model_id_key"  (seg0 10.128.0.41:40000 pid=22297)
   DETAIL:  Key (model_id, compile_params, fit_params)=(1, optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy', epochs=12,batch_size=32) already exists.
   ```
   No need to do anything, I will put a better note in the user docs about this corner case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 commented on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 commented on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-665993038


   errors and issues
   
   (1)
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'linear']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            5, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   ```
   produces
   ```
   InternalError: (psycopg2.errors.InternalError_) TypeError: cannot concatenate 'str' and 'float' objects (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 21, in <module>
       mst_loader = madlib_keras_model_selection.MstSearch(**globals())
     PL/Python function "generate_model_selection_configs", line 42, in wrapper
     PL/Python function "generate_model_selection_configs", line 287, in __init__
     PL/Python function "generate_model_selection_configs", line 426, in find_random_combinations
     PL/Python function "generate_model_selection_configs", line 490, in generate_row_string
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'lr': [0.0001, 0.1, 'linear']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            5, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/2j85)
   ```
   
   Likewise
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'log'],
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            1, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   produces
   ```
   InternalError: (psycopg2.errors.InternalError_) TypeError: cannot concatenate 'str' and 'numpy.float64' objects (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 21, in <module>
       mst_loader = madlib_keras_model_selection.MstSearch(**globals())
     PL/Python function "generate_model_selection_configs", line 42, in wrapper
     PL/Python function "generate_model_selection_configs", line 287, in __init__
     PL/Python function "generate_model_selection_configs", line 426, in find_random_combinations
     PL/Python function "generate_model_selection_configs", line 490, in generate_row_string
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'log'],
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            1, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/2j85)
   ```
   
   (2)
   For search_type = 'grid' or 'random', use should be able to enter part of the string, e.g., 'rand' for random or 'g' for for grid.  There is a MADlib function that supports this.
   
   
   (3)
   change the name of the function from `generate_model_selection_configs`
   to `generate_model_configs`
   
   
   (4)
   remove exclamations ! from error messages and random capitalization. Suggested messages:
   
   "DL: 'num_configs' and 'random_state' must be NULL for grid search"
   
   "DL: Cannot search from a distribution with grid search"
   
   "DL: 'num_configs' cannot be NULL for random search"
   
   "DL: 'search_type' must be either 'grid' or 'random'"
   
   "DL: Please choose a valid distribution type ('linear' or 'log')"
   
   "DL: {0} should be of the format [lower_bound, upper_bound, distribution_type]"
   
   
   (5)
   In addition to `linear` sampling and `log` sampling we should add another type
   called `log_near_one`
   
   config_dict[cp] = 1.0 - np.power(
   								 10, 
   								 np.random.uniform(
   								 					np.log10(1.0 - param_values[1]),
                                    					np.log10(1.0 - param_values[0])
                                                     )
                                   )
   
   This type of sampling is useful for exponentially weighted average type params like momentum, which are very sensitive to changes near 1.  It has the effect of producing more values near 1 than regular log sampling.
   
   e.g.
   momentum values in range [0.9000, 0.9005] average the prev 10 values no matter where you are in the range (no diff)
   but
   momentum values in range [0.9990, 0.9995] average the prev 1000 values for the left side and prev 2000 values for the right side (big diff), so you want to generate more samples nearer to the right side to get better coverage.
   
   
   (6)
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['Adam'],
                                             'lr': [0.9, 0.95, 'log'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   followed by 
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['SGD'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   produces
   ```
   IntegrityError: (psycopg2.errors.UniqueViolation) plpy.SPIError: duplicate key value violates unique constraint "mst_table_model_id_key"  (seg0 10.128.0.41:40000 pid=22297)
   DETAIL:  Key (model_id, compile_params, fit_params)=(1, optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy', epochs=12,batch_size=32) already exists.
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 22, in <module>
       mst_loader.load()
     PL/Python function "generate_model_selection_configs", line 313, in load
     PL/Python function "generate_model_selection_configs", line 566, in insert_into_mst_table
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs( 'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['SGD'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/gkpj)
   ```
   But it only produced the error every 2nd time I did this. i.e., 1-pass it would work then the 2nd pass it would throw the error.
   
   When it does pass, it produces
   ```
    mst_key | model_id |                                        compile_params                                        |        fit_params        
   ---------+----------+----------------------------------------------------------------------------------------------+--------------------------
          1 |        1 | optimizer='Adam(lr=0.9063214445649174)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=10,batch_size=256
          2 |        1 | optimizer='Adam(lr=0.9367722192055232)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=5,batch_size=256
          3 |        1 | optimizer='Adam(lr=0.9212048311857509)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=2,batch_size=32
          4 |        1 | optimizer='Adam(lr=0.9193149125403647)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=3,batch_size=256
          5 |        1 | optimizer='Adam(lr=0.9326284661833211)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=2,batch_size=256
          6 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=10,batch_size=256
          7 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=5,batch_size=8
          8 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=2,batch_size=1024
          9 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=3,batch_size=32
         10 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=12,batch_size=8
   (10 rows)
   ```
   is `optimizer='SGD()'...` correct or should it be `optimizer='SGD'...` ?
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] Advitya17 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
Advitya17 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668949266


   (8) I assume it's the right syntax, and the optimizer would just take the default param values (as part of the keras documentation shown below)
   ![image](https://user-images.githubusercontent.com/37190647/89366896-a1690000-d68c-11ea-9e82-21b98bffffb9.png)
   
   (9) Yes, I currently have SGD as the default when the user wishes to tune any optimizer params but doesn't specify an optimizer to tune.
   
   Does that help?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-665993038


   errors and issues
   
   (1)
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'linear']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            5, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   ```
   produces
   ```
   InternalError: (psycopg2.errors.InternalError_) TypeError: cannot concatenate 'str' and 'float' objects (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 21, in <module>
       mst_loader = madlib_keras_model_selection.MstSearch(**globals())
     PL/Python function "generate_model_selection_configs", line 42, in wrapper
     PL/Python function "generate_model_selection_configs", line 287, in __init__
     PL/Python function "generate_model_selection_configs", line 426, in find_random_combinations
     PL/Python function "generate_model_selection_configs", line 490, in generate_row_string
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'lr': [0.0001, 0.1, 'linear']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            5, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/2j85)
   ```
   
   Likewise
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'log'],
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            1, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   produces
   ```
   InternalError: (psycopg2.errors.InternalError_) TypeError: cannot concatenate 'str' and 'numpy.float64' objects (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 21, in <module>
       mst_loader = madlib_keras_model_selection.MstSearch(**globals())
     PL/Python function "generate_model_selection_configs", line 42, in wrapper
     PL/Python function "generate_model_selection_configs", line 287, in __init__
     PL/Python function "generate_model_selection_configs", line 426, in find_random_combinations
     PL/Python function "generate_model_selection_configs", line 490, in generate_row_string
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                            { 
                                             'lr': [1.0, 2.0, 'log'],
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            1, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/2j85)
   ```
   
   (2)
   For search_type = 'grid' or 'random', use should be able to enter part of the string, e.g., 'rand' for random or 'g' for for grid.  There is a MADlib function that supports this.
   
   
   (3)
   change the name of the function from `generate_model_selection_configs`
   to `generate_model_configs`
   
   
   (4)
   remove exclamations ! from error messages and random capitalization. Suggested messages:
   
   "DL: 'num_configs' and 'random_state' must be NULL for grid search"
   
   "DL: Cannot search from a distribution with grid search"
   
   "DL: 'num_configs' cannot be NULL for random search"
   
   "DL: 'search_type' must be either 'grid' or 'random'"
   
   "DL: Please choose a valid distribution type ('linear' or 'log')"
   
   "DL: {0} should be of the format [lower_bound, upper_bound, distribution_type]"
   
   
   (5)
   In addition to `linear` sampling and `log` sampling we should add another type
   called `log_near_one`
   ```
   config_dict[cp] = 1.0 - np.power(10,  np.random.uniform (np.log10 (1.0 - param_values[1]), np.log10(1.0 - param_values[0]) ) )
   ```
   This type of sampling is useful for exponentially weighted average type params like momentum, which are very sensitive to changes near 1.  It has the effect of producing more values near 1 than regular log sampling.
   
   e.g.
   momentum values in range [0.9000, 0.9005] average the prev 10 values no matter where you are in the range (no diff)
   but
   momentum values in range [0.9990, 0.9995] average the prev 1000 values for the left side and prev 2000 values for the right side (big diff), so you want to generate more samples nearer to the right side to get better coverage.
   
   
   (6)
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['Adam'],
                                             'lr': [0.9, 0.95, 'log'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   followed by 
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['SGD'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            ); 
                                            
   SELECT * FROM mst_table ORDER BY mst_key;
   ```
   produces
   ```
   IntegrityError: (psycopg2.errors.UniqueViolation) plpy.SPIError: duplicate key value violates unique constraint "mst_table_model_id_key"  (seg0 10.128.0.41:40000 pid=22297)
   DETAIL:  Key (model_id, compile_params, fit_params)=(1, optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy', epochs=12,batch_size=32) already exists.
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "generate_model_selection_configs", line 22, in <module>
       mst_loader.load()
     PL/Python function "generate_model_selection_configs", line 313, in load
     PL/Python function "generate_model_selection_configs", line 566, in insert_into_mst_table
   PL/Python function "generate_model_selection_configs"
   
   [SQL: SELECT madlib.generate_model_selection_configs( 'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                             'optimizer': ['SGD'],
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );]
   (Background on this error at: http://sqlalche.me/e/gkpj)
   ```
   But it only produced the error every 2nd time I did this. i.e., 1-pass it would work then the 2nd pass it would throw the error.
   
   When it does pass, it produces
   ```
    mst_key | model_id |                                        compile_params                                        |        fit_params        
   ---------+----------+----------------------------------------------------------------------------------------------+--------------------------
          1 |        1 | optimizer='Adam(lr=0.9063214445649174)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=10,batch_size=256
          2 |        1 | optimizer='Adam(lr=0.9367722192055232)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=5,batch_size=256
          3 |        1 | optimizer='Adam(lr=0.9212048311857509)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=2,batch_size=32
          4 |        1 | optimizer='Adam(lr=0.9193149125403647)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=3,batch_size=256
          5 |        1 | optimizer='Adam(lr=0.9326284661833211)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=2,batch_size=256
          6 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=10,batch_size=256
          7 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=5,batch_size=8
          8 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=2,batch_size=1024
          9 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=3,batch_size=32
         10 |        1 | optimizer='SGD()',metrics=['accuracy'],loss='categorical_crossentropy'                       | epochs=12,batch_size=8
   (10 rows)
   ```
   is `optimizer='SGD()'...` correct or should it be `optimizer='SGD'...` ?
   
   
   (7)
   Not all sub-params apply to all params.  For example, for optimizer, `lr` and `decay` might only apply to certain optimizer types and not others:
   ```
   optimizer='SGD'
   optimizer='rmsprop(lr=0.0001, decay=1e-6)'
   optimizer='adam(lr=0.0001)'
   ```
   In the previous method we accounted for that by doing:
   ```
   SELECT madlib.load_model_selection_table('model_arch_library', -- model architecture table
                                            'mst_table',          -- model selection table output
                                             ARRAY[1,2],          -- model ids from model architecture table
                                             ARRAY[               -- compile params   
                                                 $$loss='categorical_crossentropy',optimizer='rmsprop(lr=0.0001, decay=1e-6)',metrics=['accuracy']$$,
                                                 $$loss='categorical_crossentropy',optimizer='rmsprop(lr=0.001, decay=1e-6)',metrics=['accuracy']$$,
                                                 $$loss='categorical_crossentropy',optimizer='adam(lr=0.0001)',metrics=['accuracy']$$,
                                                 $$loss='categorical_crossentropy',optimizer='adam(lr=0.001)',metrics=['accuracy']$$
                                             ],
                                             ARRAY[                -- fit params
                                                 $$batch_size=64,epochs=5$$, 
                                                 $$batch_size=128,epochs=5$$
                                             ]
                                            );
   ```
   but how do we do this in the new method `generate_model_configs`? You could call it multiple times and incrementally build up the `mst_table` but when autoML methods call this function we need to support a 1-shot manner.  Perhaps we need to support nested dictionaries?  Something like this:
   ```
   SELECT madlib.generate_model_selection_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1],              -- model ids from model architecture table
                                            $$
                                            { 'loss': ['categorical_crossentropy'],
                                               {
                                               'opt1': {'optimizer': ['SGD']},
                                               'opt2': {'optimizer': ['rmsprop'], 'lr': [0.9, 0.95, 'log'], 'decay': [1e-6, 1e-4, 'log']},
                                               'opt3': {'optimizer': ['Adam'], 'lr': [0.9, 0.95, 'log']}
                                               },
                                             'metrics': ['accuracy']
                                            } 
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8, 32, 64, 128, 256, 1024, 4096],
                                              'epochs': [1, 2, 3, 5, 10, 12] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type
                                            5, -- num_configs
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] Advitya17 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
Advitya17 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668949266


   (8) I assume it's the right syntax, and the optimizer would just take the default param values (as part of the keras documentation)
   ![image](https://user-images.githubusercontent.com/37190647/89366896-a1690000-d68c-11ea-9e82-21b98bffffb9.png)
   
   (9) Yes, I currently have SGD as the default when the user wishes to tune any optimizer params but doesn't specify an optimizer to tune.
   
   Does that help?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 commented on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 commented on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668952722


   for (9) 
   
   Keras docs say that optimizer and loss function are mandatory params, so I don't think we should put SGD if user does not specify an optimizer.  We should let it fail, no?
   
   https://keras.io/api/optimizers/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] Advitya17 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
Advitya17 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668955356


   ![image](https://user-images.githubusercontent.com/37190647/89367810-d70ee880-d68e-11ea-8284-cfd1b30ece2e.png)
   
   According to the keras documentation, it seems there is a default value for optimizer as well (just that it seems to be rmsprop instead of my manually chosen SGD). 
   
   From my understanding of the documentation and its language, yes optimizer is required for compiling a keras model, but they seem to already be using a default optimizer (and with default params) to compile the keras model when the user may not specify one. 
   
   Please correct me if I missed anything or didn't understand it right.
   
   ![image](https://user-images.githubusercontent.com/37190647/89367845-edb53f80-d68e-11ea-999c-7353b33857d3.png)
   
   Compile method reference - https://keras.io/api/models/model_training_apis/#compile-method


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 commented on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 commented on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668947847


   From my previous comments:
   
   1,2,3,4,5,7 are fixed
   6 is OK as is, no fix needed
   
   Additional questions:
   (8)
   is this the right syntax for an optimizer with no params?
   ```
   optimizer='Adagrad()',metrics=['accuracy'],loss='categorical_crossentropy'
   ```
   
   (9)
   if I did not put an optimizer:
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   SELECT madlib.generate_model_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                               {'loss': ['categorical_crossentropy'], 
                                                'optimizer_params_list': [ {'lr': [0.0001, 0.1, 'log']} ], 
                                                'metrics': ['accuracy']}
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            2, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );
   ```
   then `SGD` shows up:  
   ```
    mst_key | model_id |                                        compile_params                                         |      fit_params       
   ---------+----------+-----------------------------------------------------------------------------------------------+-----------------------
          1 |        1 | optimizer='SGD(lr=0.002963575680717671)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=1,batch_size=8
          2 |        2 | optimizer='SGD(lr=0.027802557490831045)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=1,batch_size=8
   (2 rows)
   ```
   Do we do that or does Keras do that?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 commented on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 commented on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-670230673


   LGTM
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] fmcquillan99 edited a comment on pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
fmcquillan99 edited a comment on pull request #506:
URL: https://github.com/apache/madlib/pull/506#issuecomment-668947847


   From my previous comments:
   
   1,2,3,4,5,7 are fixed
   6 is OK as is, no fix needed
   
   Additional questions:
   (8)
   is this the right syntax for an optimizer with no params? i.e., `Adagrad()`
   ```
   optimizer='Adagrad()',metrics=['accuracy'],loss='categorical_crossentropy'
   ```
   
   (9)
   if I did not put an optimizer:
   ```
   DROP TABLE IF EXISTS mst_table, mst_table_summary;
   SELECT madlib.generate_model_configs(
                                           'model_arch_library', -- model architecture table
                                           'mst_table',          -- model selection table output
                                            ARRAY[1,2],              -- model ids from model architecture table
                                            $$
                                               {'loss': ['categorical_crossentropy'], 
                                                'optimizer_params_list': [ {'lr': [0.0001, 0.1, 'log']} ], 
                                                'metrics': ['accuracy']}
                                            $$, -- compile_param_grid 
                                            $$ 
                                            { 'batch_size': [8],
                                              'epochs': [1] 
                                            } 
                                            $$, -- fit_param_grid 
                                            
                                            'random', -- search_type (‘grid’ or ‘random’, default ‘grid’) 
                                            2, -- num_configs (number of sampled parameters. Default=10) [to limit testing] 
                                            NULL, -- random_state 
                                            NULL -- object table (Default=None)  
                                            );
   ```
   then `SGD` shows up:  
   ```
    mst_key | model_id |                                        compile_params                                         |      fit_params       
   ---------+----------+-----------------------------------------------------------------------------------------------+-----------------------
          1 |        1 | optimizer='SGD(lr=0.002963575680717671)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=1,batch_size=8
          2 |        2 | optimizer='SGD(lr=0.027802557490831045)',metrics=['accuracy'],loss='categorical_crossentropy' | epochs=1,batch_size=8
   (2 rows)
   ```
   Do we do that or does Keras do that?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] orhankislal commented on a change in pull request #506: DL: Add grid/random search for model selection with `generate_model_selection_configs`

Posted by GitBox <gi...@apache.org>.
orhankislal commented on a change in pull request #506:
URL: https://github.com/apache/madlib/pull/506#discussion_r457713770



##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_model_selection.py_in
##########
@@ -203,3 +212,365 @@ class MstLoader():
                                   object_table_name=ModelSelectionSchema.OBJECT_TABLE,
                                   **locals())
         plpy.execute(insert_summary_query)
+
+@MinWarning("warning")
+class MstSearch():
+    """
+    The utility class for generating model selection configs and loading into a MST table with model parameters.
+
+    Currently takes string representations of python dictionaries for compile and fit params.
+    Generates configs with a chosen search algorithm
+
+    Attributes:
+        model_arch_table (str): The name of model architecture table.
+        model_selection_table (str): The name of the output mst table.
+        model_id_list (list): The input list of model id choices.
+        compile_params_grid (string repr of python dict): The input of compile params choices.
+        fit_params_grid (string repr of python dict): The input of fit params choices.
+        search_type (str, default 'grid'): Hyperparameter search strategy, 'grid' or 'random'.
+
+        Only for 'random' search type (defaults None):
+            num_configs (int): Number of configs to generate.
+            random_state (int): Seed for result reproducibility.
+
+        object_table (str, default None): The name of the object table, for custom (metric) functions.
+
+    """
+
+    def __init__(self,
+                 model_arch_table,
+                 model_selection_table,
+                 model_id_list,
+                 compile_params_grid,
+                 fit_params_grid,
+                 search_type='grid',
+                 num_configs=None,
+                 random_state=None,
+                 object_table=None,
+                 **kwargs):
+
+        self.model_arch_table = model_arch_table
+        self.model_selection_table = model_selection_table
+        self.model_selection_summary_table = add_postfix(
+            model_selection_table, "_summary")
+        self.model_id_list = sorted(list(set(model_id_list)))
+
+        MstLoaderInputValidator(
+            model_arch_table=self.model_arch_table,
+            model_selection_table=self.model_selection_table,
+            model_selection_summary_table=self.model_selection_summary_table,
+            model_id_list=self.model_id_list,
+            compile_params_list=compile_params_grid,
+            fit_params_list=fit_params_grid,
+            object_table=object_table,
+            module_name='generate_model_selection_configs'
+        )
+
+        self.search_type = search_type
+        self.num_configs = num_configs
+        self.random_state = random_state
+        self.object_table = object_table
+
+        compile_params_grid = compile_params_grid.replace('\n', '').replace(' ', '')
+        fit_params_grid = fit_params_grid.replace('\n', '').replace(' ', '')
+        self.validate_inputs(compile_params_grid, fit_params_grid)
+
+        # extracting python dict
+        self.compile_params_dict = literal_eval(compile_params_grid)
+        self.fit_params_dict = literal_eval(fit_params_grid)
+
+        self.msts = []
+
+        if self.search_type == 'grid':
+            self.find_grid_combinations()
+        elif self.search_type == 'random': # else should also suffice as random search is established.
+            self.find_random_combinations()
+
+        compile_params_lst, fit_params_lst = [], []
+        for i in self.msts:
+            compile_params_lst.append(i[ModelSelectionSchema.COMPILE_PARAMS])
+            fit_params_lst.append(i[ModelSelectionSchema.FIT_PARAMS])
+        self._validate_params_and_object_table(compile_params_lst, fit_params_lst)
+
+    def load(self):
+        """The entry point for loading the model selection table.
+        """
+        # All of the side effects happen in this function.
+        if not table_exists(self.model_selection_table):
+            self.create_mst_table()
+        self.create_mst_summary_table()
+        self.insert_into_mst_table()
+
+    def validate_inputs(self, compile_params_grid, fit_params_grid):
+        """
+        Ensures validity of inputs related to grid and random search.
+
+        :param compile_params_grid: The input string repr of compile params choices.
+        :param fit_params_grid: The input string repr of fit params choices.
+        """
+
+        # TODO: add additional cases for validating params (and test it)
+
+        if self.search_type == 'grid':
+            _assert(self.num_configs is None and self.random_state is None,
+                    "'num_configs' and 'random_state' have to be NULL for Grid Search")
+            for distribution_type in ['linear', 'log']:
+                _assert(distribution_type not in compile_params_grid and distribution_type not in fit_params_grid,
+                        "Cannot search from a distribution with Grid Search!")
+        elif self.search_type == 'random':
+            _assert(self.num_configs is not None, "'num_configs' cannot be NULL for Random Search")
+        else:
+            plpy.error("'search_type' has to be either 'grid' or 'random' !")
+
+    def _validate_params_and_object_table(self, compile_params_lst, fit_params_lst):
+        if not fit_params_lst:
+            plpy.error("fit_params_list cannot be NULL")
+        for fit_params in fit_params_lst:
+            try:
+                res = parse_and_validate_fit_params(fit_params)
+            except Exception as e:
+                plpy.error(
+                    """Fit param check failed for: {0} \n
+                    {1}
+                    """.format(fit_params, str(e)))
+        if not compile_params_lst:
+            plpy.error( "compile_params_list cannot be NULL")
+        custom_fn_name = []
+        ## Initialize builtin loss/metrics functions
+        builtin_losses = dir(losses)
+        builtin_metrics = dir(metrics)
+        # Default metrics, since it is not part of the builtin metrics list
+        builtin_metrics.append('accuracy')
+        if self.object_table is not None:
+            res = plpy.execute("SELECT {0} from {1}".format(CustomFunctionSchema.FN_NAME,
+                                                            self.object_table))
+            for r in res:
+                custom_fn_name.append(r[CustomFunctionSchema.FN_NAME])
+        for compile_params in compile_params_lst:
+            try:
+                _, _, res = parse_and_validate_compile_params(compile_params)
+                # Validating if loss/metrics function called in compile_params
+                # is either defined in object table or is a built_in keras
+                # loss/metrics function
+                error_suffix = "but input object table missing!"
+                if self.object_table is not None:
+                    error_suffix = "is not defined in object table '{0}'!".format(self.object_table)
+
+                _assert(res['loss'] in custom_fn_name or res['loss'] in builtin_losses,
+                        "custom function '{0}' used in compile params " \
+                        "{1}".format(res['loss'], error_suffix))
+                if 'metrics' in res:
+                    _assert((len(set(res['metrics']).intersection(custom_fn_name)) > 0
+                             or len(set(res['metrics']).intersection(builtin_metrics)) > 0),
+                            "custom function '{0}' used in compile params " \
+                            "{1}".format(res['metrics'], error_suffix))
+
+            except Exception as e:
+                plpy.error(
+                    """Compile param check failed for: {0} \n
+                    {1}
+                    """.format(compile_params, str(e)))
+
+    def find_grid_combinations(self):
+        """
+        Finds combinations using grid search.
+        """
+        combined_dict = dict(self.compile_params_dict, **self.fit_params_dict)
+        combined_dict[ModelSelectionSchema.MODEL_ID] = self.model_id_list
+        keys, values = zip(*combined_dict.items())
+        all_configs_params = [dict(zip(keys, v)) for v in itertools_product(*values)]
+
+        # to separate the compile and fit configs
+        for config in all_configs_params:
+            combination = {}
+            compile_configs, fit_configs = {}, {}
+            for k in config:
+                if k == ModelSelectionSchema.MODEL_ID:
+                    combination[ModelSelectionSchema.MODEL_ID] = config[k]
+                elif k in self.compile_params_dict:
+                    compile_configs[k] = config[k]
+                elif k in self.fit_params_dict:
+                    fit_configs[k] = config[k]
+                else:
+                    plpy.error("{0} is an unidentified key".format(k))
+
+            combination[ModelSelectionSchema.COMPILE_PARAMS] = self.generate_row_string(compile_configs)
+            combination[ModelSelectionSchema.FIT_PARAMS] = self.generate_row_string(fit_configs)
+            self.msts.append(combination)
+
+    def find_random_combinations(self):
+        """
+        Finds combinations using random search.
+        """
+        if self.random_state:
+            seed_changes = 0
+        else:
+            seed_changes = None
+
+        for _ in range(self.num_configs):
+            combination = {}
+            if self.random_state:
+                np.random.seed(self.random_state+seed_changes)
+                seed_changes += 1
+            combination[ModelSelectionSchema.MODEL_ID] = np.random.choice(self.model_id_list)
+            compile_d = {}
+            compile_d, seed_changes = self.generate_param_config(self.compile_params_dict, compile_d, seed_changes)
+            combination[ModelSelectionSchema.COMPILE_PARAMS] = self.generate_row_string(compile_d)
+            fit_d = {}
+            fit_d, seed_changes = self.generate_param_config(self.fit_params_dict, fit_d, seed_changes)
+            combination[ModelSelectionSchema.FIT_PARAMS] = self.generate_row_string(fit_d)
+            self.msts.append(combination)
+
+    def generate_param_config(self, params_dict, config_dict, seed_changes):
+        """
+        Generating a parameter configuration for random search.
+        :param params_dict: Dictionary of params choices.
+        :param config_dict: Dictionary to store param config.
+        :param seed_changes: Changes in seed for random sampling + reproducibility.
+        :return: config_dict, seed_changes.
+        """
+        for cp in params_dict:
+            if self.random_state:
+                np.random.seed(self.random_state+seed_changes)
+                seed_changes += 1
+
+            param_values = params_dict[cp]
+
+            # sampling from a distribution
+            if param_values[-1] in ['linear', 'log']:
+                _assert(len(param_values) == 3,
+                        "{0} should have exactly 3 elements if picking from a distribution".format(cp))
+                _assert(param_values[1] > param_values[0],
+                        "{0} should be of the format [lower_bound, uppper_bound, distribution_type]".format(cp))
+                if param_values[-1] == 'linear':
+                    config_dict[cp] = np.random.uniform(param_values[0], param_values[1])
+                elif param_values[-1] == 'log':
+                    config_dict[cp] = np.power(10, np.random.uniform(np.log10(param_values[0]),
+                                                                     np.log10(param_values[1])))
+                else:
+                    plpy.error("Choose a valid distribution type! ('linear' or 'log')")
+            # random sampling
+            else:
+                config_dict[cp] = np.random.choice(params_dict[cp])
+
+        return config_dict, seed_changes
+
+    def generate_row_string(self, configs_dict):
+        """
+        Generate row strings for MST table.
+        :param configs_dict: Dictionary of params config.
+        :return: string to insert as a row in MST table.
+        """
+        result_row_string = ""
+
+        if 'optimizer' in configs_dict and 'lr' in configs_dict:
+            if configs_dict['optimizer'].lower() == 'sgd':
+                optimizer_value = "SGD"
+            elif configs_dict['optimizer'].lower() == 'rmsprop':
+                optimizer_value = "RMSprop"
+            else:
+                optimizer_value = configs_dict['optimizer'].capitalize()
+            result_row_string += "optimizer" + "=" + "'" + str(optimizer_value) \
+                                 + "(" + "lr=" + str(configs_dict['lr']) + ")" + "',"
+        elif 'optimizer' in configs_dict:
+            # lr will be set to its default value during mdoel training
+            result_row_string += "optimizer" + "=" + "'" + str(configs_dict['optimizer']) \
+                                 + "()" + "',"
+        elif 'lr' in configs_dict:
+            # default optimizer value in Keras is SGD (unless changed in a future release).
+            result_row_string += "optimizer" + "=" + "'" + "SGD" \
+                                 + "(" + "lr=" + configs_dict['lr'] + ")" + "',"
+
+        for c in configs_dict:
+            if c == 'optimizer' or c == 'lr':
+                continue
+            elif c == 'metrics':
+                if callable(configs_dict[c]):
+                    result_row_string += str(c) + "=" + "[" + str(configs_dict[c]) + "],"
+                else:
+                    result_row_string += str(c) + "=" + "['" + str(configs_dict[c]) + "'],"
+            else:
+                if type(configs_dict[c]) == str or type(configs_dict[c]) == np.string_:
+                    result_row_string += str(c) + "=" + "'" + str(configs_dict[c]) + "',"
+                else:
+                    # ints, floats, none type, booleans
+                    result_row_string += str(c) + "=" + str(configs_dict[c]) + ","
+
+        return result_row_string[:-1] # to exclude the last comma
+
+    def create_mst_table(self):
+        """Initialize the output mst table, if it doesn't exist (for incremental loading).
+        """
+
+        create_query = """
+                        CREATE TABLE {self.model_selection_table} (
+                            {mst_key} SERIAL,
+                            {model_id} INTEGER,
+                            {compile_params} VARCHAR,
+                            {fit_params} VARCHAR,
+                            unique ({model_id}, {compile_params}, {fit_params})
+                        );
+                       """.format(self=self,
+                                  mst_key=ModelSelectionSchema.MST_KEY,
+                                  model_id=ModelSelectionSchema.MODEL_ID,
+                                  compile_params=ModelSelectionSchema.COMPILE_PARAMS,
+                                  fit_params=ModelSelectionSchema.FIT_PARAMS)
+        with MinWarning('warning'):
+            plpy.execute(create_query)
+
+    def create_mst_summary_table(self):
+        """Initialize the output mst table.
+        """
+        create_query = """
+                        CREATE TABLE {self.model_selection_summary_table} (
+                            {model_arch_table} VARCHAR,
+                            {object_table} VARCHAR
+                        );
+                       """.format(self=self,
+                                  model_arch_table=ModelSelectionSchema.MODEL_ARCH_TABLE,
+                                  object_table=ModelSelectionSchema.OBJECT_TABLE)
+        with MinWarning('warning'):
+            plpy.execute(create_query)
+
+    def insert_into_mst_table(self):
+        """Insert every thing in self.msts into the mst table.
+        """
+        for mst in self.msts:
+            model_id = mst[ModelSelectionSchema.MODEL_ID]
+            compile_params = mst[ModelSelectionSchema.COMPILE_PARAMS]
+            fit_params = mst[ModelSelectionSchema.FIT_PARAMS]
+            insert_query = """
+                            INSERT INTO
+                                {self.model_selection_table}(
+                                    {model_id_col},
+                                    {compile_params_col},
+                                    {fit_params_col}
+                                )
+                            VALUES (
+                                {model_id},
+                                $${compile_params}$$,
+                                $${fit_params}$$
+                            )
+                           """.format(model_id_col=ModelSelectionSchema.MODEL_ID,
+                                      compile_params_col=ModelSelectionSchema.COMPILE_PARAMS,
+                                      fit_params_col=ModelSelectionSchema.FIT_PARAMS,
+                                      **locals())
+            plpy.execute(insert_query)
+        if self.object_table is None:
+            object_table = 'NULL::VARCHAR'
+        else:
+            object_table = '$${0}$$'.format(self.object_table)
+        insert_summary_query = """
+                        INSERT INTO
+                            {self.model_selection_summary_table}(
+                                {model_arch_table_name},
+                                {object_table_name}
+                        )
+                        VALUES (
+                            $${self.model_arch_table}$$,
+                            {object_table}
+                        )
+                       """.format(model_arch_table_name=ModelSelectionSchema.MODEL_ARCH_TABLE,
+                                  object_table_name=ModelSelectionSchema.OBJECT_TABLE,
+                                  **locals())
+        plpy.execute(insert_summary_query)

Review comment:
       New line

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_model_selection.py_in
##########
@@ -18,11 +18,20 @@
 
 import plpy
 from collections import OrderedDict
+import numpy as np
+from itertools import product as itertools_product
+from ast import literal_eval
 from madlib_keras_validator import MstLoaderInputValidator
 from utilities.control import MinWarning
-from utilities.utilities import add_postfix
+from utilities.utilities import add_postfix, extract_keyvalue_params, _assert
+from utilities.validate_args import table_exists
 from madlib_keras_wrapper import convert_string_of_args_to_dict
 from keras_model_arch_table import ModelArchSchema
+from madlib_keras_wrapper import parse_and_validate_fit_params
+from madlib_keras_wrapper import parse_and_validate_compile_params
+import keras.losses as losses
+import keras.metrics as metrics
+from madlib_keras_custom_function import CustomFunctionSchema

Review comment:
       We should sort these imports in some way. Maybe the external imports (plpy, keras etc) first and then the madlib ones would work.

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_model_selection.py_in
##########
@@ -203,3 +212,365 @@ class MstLoader():
                                   object_table_name=ModelSelectionSchema.OBJECT_TABLE,
                                   **locals())
         plpy.execute(insert_summary_query)
+
+@MinWarning("warning")
+class MstSearch():
+    """
+    The utility class for generating model selection configs and loading into a MST table with model parameters.
+
+    Currently takes string representations of python dictionaries for compile and fit params.
+    Generates configs with a chosen search algorithm
+
+    Attributes:
+        model_arch_table (str): The name of model architecture table.
+        model_selection_table (str): The name of the output mst table.
+        model_id_list (list): The input list of model id choices.
+        compile_params_grid (string repr of python dict): The input of compile params choices.
+        fit_params_grid (string repr of python dict): The input of fit params choices.
+        search_type (str, default 'grid'): Hyperparameter search strategy, 'grid' or 'random'.
+
+        Only for 'random' search type (defaults None):
+            num_configs (int): Number of configs to generate.
+            random_state (int): Seed for result reproducibility.
+
+        object_table (str, default None): The name of the object table, for custom (metric) functions.
+
+    """
+
+    def __init__(self,
+                 model_arch_table,
+                 model_selection_table,
+                 model_id_list,
+                 compile_params_grid,
+                 fit_params_grid,
+                 search_type='grid',
+                 num_configs=None,
+                 random_state=None,
+                 object_table=None,
+                 **kwargs):
+
+        self.model_arch_table = model_arch_table
+        self.model_selection_table = model_selection_table
+        self.model_selection_summary_table = add_postfix(
+            model_selection_table, "_summary")
+        self.model_id_list = sorted(list(set(model_id_list)))
+
+        MstLoaderInputValidator(
+            model_arch_table=self.model_arch_table,
+            model_selection_table=self.model_selection_table,
+            model_selection_summary_table=self.model_selection_summary_table,
+            model_id_list=self.model_id_list,
+            compile_params_list=compile_params_grid,
+            fit_params_list=fit_params_grid,
+            object_table=object_table,
+            module_name='generate_model_selection_configs'
+        )
+
+        self.search_type = search_type
+        self.num_configs = num_configs
+        self.random_state = random_state
+        self.object_table = object_table
+
+        compile_params_grid = compile_params_grid.replace('\n', '').replace(' ', '')
+        fit_params_grid = fit_params_grid.replace('\n', '').replace(' ', '')
+        self.validate_inputs(compile_params_grid, fit_params_grid)
+
+        # extracting python dict
+        self.compile_params_dict = literal_eval(compile_params_grid)
+        self.fit_params_dict = literal_eval(fit_params_grid)
+
+        self.msts = []
+
+        if self.search_type == 'grid':
+            self.find_grid_combinations()
+        elif self.search_type == 'random': # else should also suffice as random search is established.
+            self.find_random_combinations()
+
+        compile_params_lst, fit_params_lst = [], []
+        for i in self.msts:
+            compile_params_lst.append(i[ModelSelectionSchema.COMPILE_PARAMS])
+            fit_params_lst.append(i[ModelSelectionSchema.FIT_PARAMS])
+        self._validate_params_and_object_table(compile_params_lst, fit_params_lst)
+
+    def load(self):
+        """The entry point for loading the model selection table.
+        """
+        # All of the side effects happen in this function.
+        if not table_exists(self.model_selection_table):
+            self.create_mst_table()
+        self.create_mst_summary_table()
+        self.insert_into_mst_table()
+
+    def validate_inputs(self, compile_params_grid, fit_params_grid):
+        """
+        Ensures validity of inputs related to grid and random search.
+
+        :param compile_params_grid: The input string repr of compile params choices.
+        :param fit_params_grid: The input string repr of fit params choices.
+        """
+
+        # TODO: add additional cases for validating params (and test it)
+
+        if self.search_type == 'grid':
+            _assert(self.num_configs is None and self.random_state is None,
+                    "'num_configs' and 'random_state' have to be NULL for Grid Search")

Review comment:
       In general, MADlib error messages start with the module's name. Just putting `DL:` should be good enough

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_model_selection.py_in
##########
@@ -203,3 +212,365 @@ class MstLoader():
                                   object_table_name=ModelSelectionSchema.OBJECT_TABLE,
                                   **locals())
         plpy.execute(insert_summary_query)
+
+@MinWarning("warning")
+class MstSearch():
+    """
+    The utility class for generating model selection configs and loading into a MST table with model parameters.
+
+    Currently takes string representations of python dictionaries for compile and fit params.
+    Generates configs with a chosen search algorithm
+
+    Attributes:
+        model_arch_table (str): The name of model architecture table.
+        model_selection_table (str): The name of the output mst table.
+        model_id_list (list): The input list of model id choices.
+        compile_params_grid (string repr of python dict): The input of compile params choices.
+        fit_params_grid (string repr of python dict): The input of fit params choices.
+        search_type (str, default 'grid'): Hyperparameter search strategy, 'grid' or 'random'.
+
+        Only for 'random' search type (defaults None):
+            num_configs (int): Number of configs to generate.
+            random_state (int): Seed for result reproducibility.
+
+        object_table (str, default None): The name of the object table, for custom (metric) functions.
+
+    """
+
+    def __init__(self,
+                 model_arch_table,
+                 model_selection_table,
+                 model_id_list,
+                 compile_params_grid,
+                 fit_params_grid,
+                 search_type='grid',
+                 num_configs=None,
+                 random_state=None,
+                 object_table=None,
+                 **kwargs):
+
+        self.model_arch_table = model_arch_table
+        self.model_selection_table = model_selection_table
+        self.model_selection_summary_table = add_postfix(
+            model_selection_table, "_summary")
+        self.model_id_list = sorted(list(set(model_id_list)))
+
+        MstLoaderInputValidator(
+            model_arch_table=self.model_arch_table,
+            model_selection_table=self.model_selection_table,
+            model_selection_summary_table=self.model_selection_summary_table,
+            model_id_list=self.model_id_list,
+            compile_params_list=compile_params_grid,
+            fit_params_list=fit_params_grid,
+            object_table=object_table,
+            module_name='generate_model_selection_configs'
+        )
+
+        self.search_type = search_type
+        self.num_configs = num_configs
+        self.random_state = random_state
+        self.object_table = object_table
+
+        compile_params_grid = compile_params_grid.replace('\n', '').replace(' ', '')
+        fit_params_grid = fit_params_grid.replace('\n', '').replace(' ', '')
+        self.validate_inputs(compile_params_grid, fit_params_grid)
+
+        # extracting python dict
+        self.compile_params_dict = literal_eval(compile_params_grid)
+        self.fit_params_dict = literal_eval(fit_params_grid)
+
+        self.msts = []
+
+        if self.search_type == 'grid':
+            self.find_grid_combinations()
+        elif self.search_type == 'random': # else should also suffice as random search is established.
+            self.find_random_combinations()
+
+        compile_params_lst, fit_params_lst = [], []
+        for i in self.msts:
+            compile_params_lst.append(i[ModelSelectionSchema.COMPILE_PARAMS])
+            fit_params_lst.append(i[ModelSelectionSchema.FIT_PARAMS])
+        self._validate_params_and_object_table(compile_params_lst, fit_params_lst)
+
+    def load(self):
+        """The entry point for loading the model selection table.
+        """
+        # All of the side effects happen in this function.
+        if not table_exists(self.model_selection_table):
+            self.create_mst_table()
+        self.create_mst_summary_table()
+        self.insert_into_mst_table()
+
+    def validate_inputs(self, compile_params_grid, fit_params_grid):
+        """
+        Ensures validity of inputs related to grid and random search.
+
+        :param compile_params_grid: The input string repr of compile params choices.
+        :param fit_params_grid: The input string repr of fit params choices.
+        """
+
+        # TODO: add additional cases for validating params (and test it)
+
+        if self.search_type == 'grid':
+            _assert(self.num_configs is None and self.random_state is None,
+                    "'num_configs' and 'random_state' have to be NULL for Grid Search")
+            for distribution_type in ['linear', 'log']:
+                _assert(distribution_type not in compile_params_grid and distribution_type not in fit_params_grid,
+                        "Cannot search from a distribution with Grid Search!")
+        elif self.search_type == 'random':
+            _assert(self.num_configs is not None, "'num_configs' cannot be NULL for Random Search")
+        else:
+            plpy.error("'search_type' has to be either 'grid' or 'random' !")
+
+    def _validate_params_and_object_table(self, compile_params_lst, fit_params_lst):
+        if not fit_params_lst:
+            plpy.error("fit_params_list cannot be NULL")
+        for fit_params in fit_params_lst:
+            try:
+                res = parse_and_validate_fit_params(fit_params)
+            except Exception as e:
+                plpy.error(
+                    """Fit param check failed for: {0} \n
+                    {1}
+                    """.format(fit_params, str(e)))
+        if not compile_params_lst:
+            plpy.error( "compile_params_list cannot be NULL")
+        custom_fn_name = []
+        ## Initialize builtin loss/metrics functions
+        builtin_losses = dir(losses)
+        builtin_metrics = dir(metrics)
+        # Default metrics, since it is not part of the builtin metrics list
+        builtin_metrics.append('accuracy')
+        if self.object_table is not None:
+            res = plpy.execute("SELECT {0} from {1}".format(CustomFunctionSchema.FN_NAME,
+                                                            self.object_table))
+            for r in res:
+                custom_fn_name.append(r[CustomFunctionSchema.FN_NAME])
+        for compile_params in compile_params_lst:
+            try:
+                _, _, res = parse_and_validate_compile_params(compile_params)
+                # Validating if loss/metrics function called in compile_params
+                # is either defined in object table or is a built_in keras
+                # loss/metrics function
+                error_suffix = "but input object table missing!"
+                if self.object_table is not None:
+                    error_suffix = "is not defined in object table '{0}'!".format(self.object_table)
+
+                _assert(res['loss'] in custom_fn_name or res['loss'] in builtin_losses,
+                        "custom function '{0}' used in compile params " \
+                        "{1}".format(res['loss'], error_suffix))
+                if 'metrics' in res:
+                    _assert((len(set(res['metrics']).intersection(custom_fn_name)) > 0
+                             or len(set(res['metrics']).intersection(builtin_metrics)) > 0),
+                            "custom function '{0}' used in compile params " \
+                            "{1}".format(res['metrics'], error_suffix))
+
+            except Exception as e:
+                plpy.error(
+                    """Compile param check failed for: {0} \n
+                    {1}
+                    """.format(compile_params, str(e)))
+
+    def find_grid_combinations(self):
+        """
+        Finds combinations using grid search.
+        """
+        combined_dict = dict(self.compile_params_dict, **self.fit_params_dict)
+        combined_dict[ModelSelectionSchema.MODEL_ID] = self.model_id_list
+        keys, values = zip(*combined_dict.items())
+        all_configs_params = [dict(zip(keys, v)) for v in itertools_product(*values)]
+
+        # to separate the compile and fit configs
+        for config in all_configs_params:
+            combination = {}
+            compile_configs, fit_configs = {}, {}
+            for k in config:
+                if k == ModelSelectionSchema.MODEL_ID:
+                    combination[ModelSelectionSchema.MODEL_ID] = config[k]
+                elif k in self.compile_params_dict:
+                    compile_configs[k] = config[k]
+                elif k in self.fit_params_dict:
+                    fit_configs[k] = config[k]
+                else:
+                    plpy.error("{0} is an unidentified key".format(k))
+
+            combination[ModelSelectionSchema.COMPILE_PARAMS] = self.generate_row_string(compile_configs)
+            combination[ModelSelectionSchema.FIT_PARAMS] = self.generate_row_string(fit_configs)
+            self.msts.append(combination)
+
+    def find_random_combinations(self):
+        """
+        Finds combinations using random search.
+        """
+        if self.random_state:

Review comment:
       This if clause can be combined into a single line: `seed_changes = 0 if self.random_state else None` 
   I don't think you even need this, the random_state is checked again in any case. If you set it to 0, it will never change (if `self.random_state` is false)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org