You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@madlib.apache.org by fm...@apache.org on 2019/12/17 20:40:51 UTC

[madlib] branch master updated: misc user doc updates for 1dot17

This is an automated email from the ASF dual-hosted git repository.

fmcquillan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git


The following commit(s) were added to refs/heads/master by this push:
     new ec5614f  misc user doc updates for 1dot17
ec5614f is described below

commit ec5614fe34fc4e410ac226a60985051fc166ee8e
Author: Frank McQuillan <fm...@pivotal.io>
AuthorDate: Tue Dec 17 12:38:01 2019 -0800

    misc user doc updates for 1dot17
---
 doc/mainpage.dox.in                                |  6 +--
 .../deep_learning/input_data_preprocessor.sql_in   |  4 +-
 .../deep_learning/keras_model_arch_table.sql_in    |  9 ++--
 .../modules/deep_learning/madlib_keras.sql_in      | 57 +++++++++++++++-------
 .../madlib_keras_fit_multiple_model.sql_in         | 28 ++++++-----
 src/ports/postgres/modules/knn/knn.sql_in          |  4 ++
 6 files changed, 69 insertions(+), 39 deletions(-)

diff --git a/doc/mainpage.dox.in b/doc/mainpage.dox.in
index 0e7b426..82be4a5 100644
--- a/doc/mainpage.dox.in
+++ b/doc/mainpage.dox.in
@@ -292,9 +292,9 @@ Interface and implementation are subject to change.
         @defgroup grp_gpu_configuration GPU Configuration
         @defgroup grp_keras Keras
         @defgroup grp_keras_model_arch Load Models
-        @defgroup grp_model_selection Model Selection
-        @brief Train multiple deep learning models at the same time.
-        @details Train multiple deep learning models at the same time.
+        @defgroup grp_model_selection Model Selection for DL
+        @brief Train multiple deep learning models at the same time for model architecture search and hyperparameter selection.
+        @details Train multiple deep learning models at the same time for model architecture search and hyperparameter selection.
         @{
             @defgroup grp_automl AutoML
             @defgroup grp_keras_run_model_selection Run Model Selection
diff --git a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
index ddc356f..f243417 100644
--- a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
+++ b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
@@ -853,7 +853,9 @@ Geoffrey Hinton with Nitish Srivastava and Kevin Swersky, http://www.cs.toronto.
 @anchor related
 @par Related Topics
 
-minibatch_preprocessing.sql_in
+training_preprocessor_dl()
+
+validation_preprocessor_dl()
 
 gpu_configuration()
 
diff --git a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
index b1bf150..cc915bb 100644
--- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
+++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
@@ -275,11 +275,10 @@ SELECT COUNT(*) FROM model_arch_library WHERE model_weights IS NOT NULL;
 -------+
      1
 </pre>
-Load weights from Keras using psycopg2.
-(Psycopg is a PostgreSQL database adapter for the
-Python programming language.) As above we need to
-flatten then serialize the weights to store as a
-PostgreSQL binary data type.
+Load weights from Keras using psycopg2.  (Psycopg is a PostgreSQL database adapter for the
+Python programming language.) As above we need to flatten then serialize the weights to store as a
+PostgreSQL binary data type.  Note that the psycopg2.Binary function used below will increase the size of the
+Python object for the weights, so if your model is large it might be better to use a PL/Python function as above.
 <pre class="example">
 import psycopg2
 import psycopg2 as p2
diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in
index 6127031..0a395e8 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in
@@ -737,7 +737,12 @@ madlib_keras_predict_byom(
   <DT>class_values (optional)</DT>
   <DD>TEXT[], default: NULL.
     List of class labels that were used while training the model. See the 'output_table'
-    column for more details.
+    column above for more details.
+
+    @note
+    If you specify the class values parameter,
+    it must reflect how the dependent variable was 1-hot encoded for training. If you accidently
+    pick another order that does not match the 1-hot encoding, the predictions would be wrong.
   </DD>
 
   <DT>normalizing_const (optional)</DT>
@@ -1166,7 +1171,7 @@ WHERE iris_predict.estimated_class_text != iris_test.class_text;
      6
 (1 row)
 </pre>
-Percent missclassifications:
+Accuracy:
 <pre class="example">
 SELECT round(count(*)*100/(150*0.2),2) as test_accuracy_percent from
     (select iris_test.class_text as actual, iris_predict.estimated_class_text as estimated
@@ -1188,10 +1193,18 @@ syntax. See <a href="group__grp__keras__model__arch.html">load_keras_model</a>
 for details on how to load the model architecture and weights.
 In this example we will use weights we already have:
 <pre class="example">
-UPDATE model_arch_library SET model_weights = model_weights FROM iris_model WHERE model_id = 1;
+UPDATE model_arch_library
+SET model_weights = iris_model.model_weights
+FROM iris_model
+WHERE model_arch_library.model_id = 1;
 </pre>
 Now train using a model from the model architecture table directly
-without referencing the model table from the MADlib training:
+without referencing the model table from the MADlib training.  Note that if you
+specify the class values parameter as we do below, it must reflect how the dependent
+variable was 1-hot encoded for training. In this example the 'training_preprocessor_dl()'
+in Step 2 above encoded in the order {'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'} so
+this is the order we pass in the parameter. If we accidently pick another order that does
+not match the 1-hot encoding, the predictions would be wrong.
 <pre class="example">
 DROP TABLE IF EXISTS iris_predict_byom;
 SELECT madlib.madlib_keras_predict_byom('model_arch_library',  -- model arch table
@@ -1254,7 +1267,7 @@ WHERE iris_predict_byom.estimated_dependent_var != iris_test.class_text;
      6
 (1 row)
 </pre>
-Percent missclassifications:
+Accuracy:
 <pre class="example">
 SELECT round(count(*)*100/(150*0.2),2) as test_accuracy_percent from
     (select iris_test.class_text as actual, iris_predict_byom.estimated_dependent_var as estimated
@@ -1495,7 +1508,10 @@ Fetch the weights from a previous MADlib run. (Normally
 these would be downloaded from a source that trained
 the same model architecture on a related dataset.)
 <pre class="example">
-UPDATE model_arch_library SET model_weights = model_weights FROM iris_model WHERE model_id = 2;
+UPDATE model_arch_library
+SET model_weights = iris_model.model_weights
+FROM iris_model
+WHERE model_arch_library.model_id = 2;
 </pre>
 Now train the model using the transfer model and the pre-trained weights:
 <pre class="example">
@@ -1556,23 +1572,26 @@ and versions.
 
 2.  Classification is currently supported, not regression.
 
-3.  On the effect of database cluster size: as the database cluster
-size increases, the per iteration loss will be higher since the
-model only sees 1/n of the data, where n is the number of segments.
-However, each iteration runs faster than single node because it is only
-traversing 1/n of the data.  For large data sets, all else being equal,
-a bigger cluster will achieve a given accuracy faster than a single node
-although it may take more iterations to achieve that accuracy.
-However, for highly non-convex solution spaces, convergence behavior
-may diminish as cluster size increases.  Ensure that each segment has
-sufficient volume of data and examples of each class value.
-
 @anchor background
 @par Technical Background
 
 For an introduction to deep learning foundations, including MLP and CNN,
 refer to [6].
 
+This module trains a single large model across the database cluster
+using the bulk synchronous parallel (BSP) approach, with model averaging [7].
+
+On the effect of database cluster size: as the database cluster size increases, the per iteration
+loss will be higher since the model only sees 1/n of the data, where n is the number of segments.
+However, each iteration runs faster than single node because it is only traversing 1/n of the data.
+For highly non-convex solution spaces, convergence behavior may diminish as cluster size increases.
+Ensure that each segment has sufficient volume of data and examples of each class value.
+
+Alternatively, to train multiple models at the same time for model
+architecture search or hyperparameter tuning, you can
+use <a href="group__grp__keras__run__model__selection.html">Model Selection</a>,
+which does not do model averaging and hence may have better covergence efficiency.
+
 @anchor literature
 @literature
 
@@ -1591,6 +1610,10 @@ http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
 
 [6] Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville, MIT Press, 2016.
 
+[7] "Resource-Efficient and Reproducible Model Selection on Deep Learning Systems," Supun Nakandala,
+Yuhao Zhang, and Arun Kumar, Technical Report, Computer Science and Engineering, University of California,
+San Diego https://adalabucsd.github.io/papers/TR_2019_Cerebro.pdf.
+
 @anchor related
 @par Related Topics
 
diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in
index 1ddbd18..c0a68b3 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in
@@ -1314,24 +1314,26 @@ and versions.
 
 2.  Classification is currently supported, not regression.
 
-3.  On the effect of database cluster size: as the database cluster
-size increases, it will be proportionally faster to train a set of
-models, as long as you have at least as many model selection tuples
-as segments.  This is because model state is "hopped" from
-segment to segment and training takes place in parallel.  See [1,2]
-for details on how model hopping works.  If you have fewer model selection
-tuples to train than segments, then some segments may not be busy 100%
-of the time so speedup will not necessarily be linear with database
-cluster size.  Inference (predict) is an embarrassingly parallel
-operation so inference runtimes will be proportionally faster as the number
-of segments increases.
-
 @anchor background
 @par Technical Background
 
 For an introduction to deep learning foundations, including MLP and CNN,
 refer to [7].
 
+This module trains many models a time across the database cluster in order
+to explore network architectures and hyperparameters.  It uses model hopper
+parallelism (MOP) and has high convergence efficiency since it does not do
+model averaging [2].
+
+On the effect of database cluster size: as the database cluster size increases,
+it will be proportionally faster to train a set of models, as long as you have at
+least as many model selection tuples as segments. This is because model state is "hopped" from
+segment to segment and training takes place in parallel [1,2]. If you have fewer model
+selection tuples to train than segments, then some
+segments may not be busy 100% of the time so speedup will not necessarily be linear with
+database cluster size. Inference (predict) is an embarrassingly parallel operation so
+inference runtimes will be proportionally faster as the number of segments increases.
+
 @anchor literature
 @literature
 
@@ -1340,7 +1342,7 @@ refer to [7].
 Supun Nakandala, Yuhao Zhang, and Arun Kumar, ACM SIGMOD 2019 DEEM Workshop,
 https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf
 
-[2] Resource-Efficient and Reproducible Model Selection on Deep Learning Systems,"
+[2] "Resource-Efficient and Reproducible Model Selection on Deep Learning Systems,"
 Supun Nakandala, Yuhao Zhang, and Arun Kumar, Technical Report, Computer Science and
 Engineering, University of California, San Diego
 https://adalabucsd.github.io/papers/TR_2019_Cerebro.pdf
diff --git a/src/ports/postgres/modules/knn/knn.sql_in b/src/ports/postgres/modules/knn/knn.sql_in
index daeddc8..22822ed 100644
--- a/src/ports/postgres/modules/knn/knn.sql_in
+++ b/src/ports/postgres/modules/knn/knn.sql_in
@@ -121,6 +121,10 @@ in a column of type <tt>DOUBLE PRECISION[]</tt>.
 <dd>TEXT. Name of the column with testing data points
 or expression that evaluates to a numeric array</dd>
 
+@note
+For unsupervised nearest neighbors, make the test dataset the same as the source dataset,
+so the nearest neighbor of each point is the point itself, with a zero distance.
+
 <dt>test_id</dt>
 <dd>TEXT. Name of the column having ids of data points in test data table.</dd>