You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by GitBox <gi...@apache.org> on 2021/01/21 19:10:36 UTC

[GitHub] [madlib] orhankislal opened a new pull request #532: DL: Add multiple variable support

orhankislal opened a new pull request #532:
URL: https://github.com/apache/madlib/pull/532


   This commit adds support for multiple dependent and independent variables.
   These changes should allow the users to use YOLO v3 and v4 models.
   Note that the existing interface with single dependent and single independent
   variables still works as expected. The output table formats have changed
   slightly to accomodate this new feature. The summary tables store various
   fields such as num_classes or dep_vartype as arrays, even if they have single
   entries.
   
   The implementation reads up to 5 dependent and 5 independent variables
   separately thanks to the new interface. If more variables are passed, they are
   packed into bytea arrays. A high number of large variables might cause us to go
   over the 1 GB limit of Postgres due to packing.
   
   <!--  
   
   Thanks for sending a pull request!  Here are some tips for you:
   1. Refer to this link for contribution guidelines https://cwiki.apache.org/confluence/display/MADLIB/Contribution+Guidelines
   2. Please Provide the Module Name, a JIRA Number and a short description about your changes.
   -->
   
   - [ ] Add the module name, JIRA# to PR/commit and description.
   - [ ] Add tests for the change. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on a change in pull request #532: DL: Add multiple variable support

Posted by GitBox <gi...@apache.org>.
reductionista commented on a change in pull request #532:
URL: https://github.com/apache/madlib/pull/532#discussion_r562896522



##########
File path: src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
##########
@@ -940,66 +910,13 @@ $$ LANGUAGE plpythonu VOLATILE
 m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');
 
 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.training_preprocessor_dl(
-    source_table            VARCHAR,
-    output_table            VARCHAR,
-    dependent_varname       VARCHAR,
-    independent_varname     VARCHAR,
-    buffer_size             INTEGER,
-    normalizing_const       REAL,
-    num_classes             INTEGER
-) RETURNS VOID AS $$
-  SELECT MADLIB_SCHEMA.training_preprocessor_dl($1, $2, $3, $4, $5, $6, $7, NULL);
-$$ LANGUAGE sql VOLATILE
-m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');
-
-CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.training_preprocessor_dl(
-    source_table            VARCHAR,
-    output_table            VARCHAR,
-    dependent_varname       VARCHAR,
-    independent_varname     VARCHAR,
-    buffer_size             INTEGER,
-    normalizing_const       REAL
-) RETURNS VOID AS $$
-  SELECT MADLIB_SCHEMA.training_preprocessor_dl($1, $2, $3, $4, $5, $6, NULL, NULL);
-$$ LANGUAGE sql VOLATILE
-m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');
-
-CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.training_preprocessor_dl(
-    source_table            VARCHAR,
-    output_table            VARCHAR,
-    dependent_varname       VARCHAR,
-    independent_varname     VARCHAR,
-    buffer_size             INTEGER
-) RETURNS VOID AS $$
-  SELECT MADLIB_SCHEMA.training_preprocessor_dl($1, $2, $3, $4, $5, 1.0, NULL, NULL);
-$$ LANGUAGE sql VOLATILE
-m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');
-
-CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.training_preprocessor_dl(
-    source_table            VARCHAR,
-    output_table            VARCHAR,
-    dependent_varname       VARCHAR,
-    independent_varname     VARCHAR
-) RETURNS VOID AS $$
-  SELECT MADLIB_SCHEMA.training_preprocessor_dl($1, $2, $3, $4, NULL, 1.0, NULL, NULL);
-$$ LANGUAGE sql VOLATILE
-m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');
-
-CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.training_preprocessor_dl(
-    message VARCHAR

Review comment:
       Thanks for cleaning up all these extra function definitions, so much easier to read!

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_predict.py_in
##########
@@ -86,18 +87,34 @@ class BasePredict():
         if self.pred_type == 1:
             rank_create_sql = ""
 
-        self.pred_vartype = self.dependent_vartype.strip('[]')
-        unnest_sql = ''
-        if self.pred_vartype in ['text', 'character varying', 'varchar']:
+        self.pred_vartype = [i.strip('[]') for i in self.dependent_vartype]
+        unnest_sql = []
+        full_class_name_list = []
+        full_class_value_list = []
+
+        for i in range(self.dependent_var_count):
+
+            if self.pred_vartype[i] in ['text', 'character varying', 'varchar']:
+
+                unnest_sql.append("unnest(ARRAY{0}) AS {1}".format(
+                    ['NULL' if j is None else j for j in class_values[i]],
+                    self.dependent_varname[i]))

Review comment:
       I think this should be:
      quote_ident(self.dependent_varname[i])
   instead of:
      self.dependent_varname[i]

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras.sql_in
##########
@@ -1811,6 +1697,48 @@ PythonFunctionBodyOnlyNoSchema(`deep_learning', `madlib_keras')
 $$ LANGUAGE plpythonu
 m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `NO SQL', `');
 
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.fit_transition_wide(
+    state                       BYTEA,
+    dependent_var1              BYTEA,
+    dependent_var2              BYTEA,
+    dependent_var3              BYTEA,
+    dependent_var4              BYTEA,
+    dependent_var5              BYTEA,
+    independent_var1            BYTEA,
+    independent_var2            BYTEA,
+    independent_var3            BYTEA,
+    independent_var4            BYTEA,
+    independent_var5            BYTEA,

Review comment:
       Have you tried using the VARIADIC keyword here to get past this 5 param limit?  I recently discovered it works in both gpdb5 and gpdb6.  Example:
   ```
   CREATE FUNCTION fit_transition_wide(dependent_var VARIADIC BYTEA[]) RETURNS INTEGER[] AS $$ SELECT ARRAY[LENGTH($1[1]), LENGTH($1[2]), LENGTH($1[3])]; $$ LANGUAGE SQL;
   
   SELECT fit_transition_wide('abcd'::BYTEA, 'def'::BYTEA, 'hijkml'::BYTEA);
    fit_transition_wide
   ---------------------
    {4,3,6}
   (1 row)
   
   or
   
   select fit_transition_wide(VARIADIC ARRAY['abcd'::BYTEA, 'def'::BYTEA, 'hijkml'::BYTEA]);
    fit_transition_wide
   ---------------------
    {4,3,6}
   ```
   It basically works the same as *args in python.  It has to be the last parameter of the function, after all the others--the only catch is, there can only be one of them for each function.  But since they're both the same type, each with an equal number of params, it seems like we might be able to pass then both through the same VARIADIC param and have python split it in half after receiving it.
   
   Not sure if that might re-introduce the 1GB limit though.  If passing them as 10 separate params gives us a 10GB limit but passing them all through VARIADIC limits us to only 1GB then it wouldn't be worth it.  I'd assume most people won't care about having more than 5 anyway, but not sure.

##########
File path: src/ports/postgres/modules/deep_learning/madlib_keras_predict.py_in
##########
@@ -116,63 +133,82 @@ class BasePredict():
 
         # Calling CREATE TABLE instead of CTAS, to ensure that the plan_cache_mode
         # guc codepath is called when passing in the weights
-        plpy.execute("""
+        sql = """
             CREATE TABLE {self.output_table}
                 ({self.id_col} {self.id_col_type},
-                 {self.dependent_varname} {self.pred_vartype},
+                 class_name TEXT,
+                 class_value TEXT,
                  {pred_col_name} {pred_col_type},
                  rank INTEGER)
-            """.format(**locals()))
+            """.format(**locals())
+        plpy.execute(sql)
+
+        independent_varname_sql = ["{0}::REAL[]".format(i) for i in self.independent_varname]

Review comment:
       quote_ident(self.independent_varname)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] orhankislal merged pull request #532: DL: Add multiple variable support

Posted by GitBox <gi...@apache.org>.
orhankislal merged pull request #532:
URL: https://github.com/apache/madlib/pull/532


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on pull request #532: DL: Add multiple variable support

Posted by GitBox <gi...@apache.org>.
reductionista commented on pull request #532:
URL: https://github.com/apache/madlib/pull/532#issuecomment-767014388


   LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on pull request #532: DL: Add multiple variable support

Posted by GitBox <gi...@apache.org>.
reductionista commented on pull request #532:
URL: https://github.com/apache/madlib/pull/532#issuecomment-767014388


   LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org