You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@madlib.apache.org by fm...@apache.org on 2019/12/17 20:40:51 UTC
[madlib] branch master updated: misc user doc updates for 1dot17
This is an automated email from the ASF dual-hosted git repository.
fmcquillan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git
The following commit(s) were added to refs/heads/master by this push:
new ec5614f misc user doc updates for 1dot17
ec5614f is described below
commit ec5614fe34fc4e410ac226a60985051fc166ee8e
Author: Frank McQuillan <fm...@pivotal.io>
AuthorDate: Tue Dec 17 12:38:01 2019 -0800
misc user doc updates for 1dot17
---
doc/mainpage.dox.in | 6 +--
.../deep_learning/input_data_preprocessor.sql_in | 4 +-
.../deep_learning/keras_model_arch_table.sql_in | 9 ++--
.../modules/deep_learning/madlib_keras.sql_in | 57 +++++++++++++++-------
.../madlib_keras_fit_multiple_model.sql_in | 28 ++++++-----
src/ports/postgres/modules/knn/knn.sql_in | 4 ++
6 files changed, 69 insertions(+), 39 deletions(-)
diff --git a/doc/mainpage.dox.in b/doc/mainpage.dox.in
index 0e7b426..82be4a5 100644
--- a/doc/mainpage.dox.in
+++ b/doc/mainpage.dox.in
@@ -292,9 +292,9 @@ Interface and implementation are subject to change.
@defgroup grp_gpu_configuration GPU Configuration
@defgroup grp_keras Keras
@defgroup grp_keras_model_arch Load Models
- @defgroup grp_model_selection Model Selection
- @brief Train multiple deep learning models at the same time.
- @details Train multiple deep learning models at the same time.
+ @defgroup grp_model_selection Model Selection for DL
+ @brief Train multiple deep learning models at the same time for model architecture search and hyperparameter selection.
+ @details Train multiple deep learning models at the same time for model architecture search and hyperparameter selection.
@{
@defgroup grp_automl AutoML
@defgroup grp_keras_run_model_selection Run Model Selection
diff --git a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
index ddc356f..f243417 100644
--- a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
+++ b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
@@ -853,7 +853,9 @@ Geoffrey Hinton with Nitish Srivastava and Kevin Swersky, http://www.cs.toronto.
@anchor related
@par Related Topics
-minibatch_preprocessing.sql_in
+training_preprocessor_dl()
+
+validation_preprocessor_dl()
gpu_configuration()
diff --git a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
index b1bf150..cc915bb 100644
--- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
+++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
@@ -275,11 +275,10 @@ SELECT COUNT(*) FROM model_arch_library WHERE model_weights IS NOT NULL;
-------+
1
</pre>
-Load weights from Keras using psycopg2.
-(Psycopg is a PostgreSQL database adapter for the
-Python programming language.) As above we need to
-flatten then serialize the weights to store as a
-PostgreSQL binary data type.
+Load weights from Keras using psycopg2. (Psycopg is a PostgreSQL database adapter for the
+Python programming language.) As above we need to flatten then serialize the weights to store as a
+PostgreSQL binary data type. Note that the psycopg2.Binary function used below will increase the size of the
+Python object for the weights, so if your model is large it might be better to use a PL/Python function as above.
<pre class="example">
import psycopg2
import psycopg2 as p2
diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in
index 6127031..0a395e8 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in
@@ -737,7 +737,12 @@ madlib_keras_predict_byom(
<DT>class_values (optional)</DT>
<DD>TEXT[], default: NULL.
List of class labels that were used while training the model. See the 'output_table'
- column for more details.
+ column above for more details.
+
+ @note
+ If you specify the class values parameter,
+ it must reflect how the dependent variable was 1-hot encoded for training. If you accidently
+ pick another order that does not match the 1-hot encoding, the predictions would be wrong.
</DD>
<DT>normalizing_const (optional)</DT>
@@ -1166,7 +1171,7 @@ WHERE iris_predict.estimated_class_text != iris_test.class_text;
6
(1 row)
</pre>
-Percent missclassifications:
+Accuracy:
<pre class="example">
SELECT round(count(*)*100/(150*0.2),2) as test_accuracy_percent from
(select iris_test.class_text as actual, iris_predict.estimated_class_text as estimated
@@ -1188,10 +1193,18 @@ syntax. See <a href="group__grp__keras__model__arch.html">load_keras_model</a>
for details on how to load the model architecture and weights.
In this example we will use weights we already have:
<pre class="example">
-UPDATE model_arch_library SET model_weights = model_weights FROM iris_model WHERE model_id = 1;
+UPDATE model_arch_library
+SET model_weights = iris_model.model_weights
+FROM iris_model
+WHERE model_arch_library.model_id = 1;
</pre>
Now train using a model from the model architecture table directly
-without referencing the model table from the MADlib training:
+without referencing the model table from the MADlib training. Note that if you
+specify the class values parameter as we do below, it must reflect how the dependent
+variable was 1-hot encoded for training. In this example the 'training_preprocessor_dl()'
+in Step 2 above encoded in the order {'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'} so
+this is the order we pass in the parameter. If we accidently pick another order that does
+not match the 1-hot encoding, the predictions would be wrong.
<pre class="example">
DROP TABLE IF EXISTS iris_predict_byom;
SELECT madlib.madlib_keras_predict_byom('model_arch_library', -- model arch table
@@ -1254,7 +1267,7 @@ WHERE iris_predict_byom.estimated_dependent_var != iris_test.class_text;
6
(1 row)
</pre>
-Percent missclassifications:
+Accuracy:
<pre class="example">
SELECT round(count(*)*100/(150*0.2),2) as test_accuracy_percent from
(select iris_test.class_text as actual, iris_predict_byom.estimated_dependent_var as estimated
@@ -1495,7 +1508,10 @@ Fetch the weights from a previous MADlib run. (Normally
these would be downloaded from a source that trained
the same model architecture on a related dataset.)
<pre class="example">
-UPDATE model_arch_library SET model_weights = model_weights FROM iris_model WHERE model_id = 2;
+UPDATE model_arch_library
+SET model_weights = iris_model.model_weights
+FROM iris_model
+WHERE model_arch_library.model_id = 2;
</pre>
Now train the model using the transfer model and the pre-trained weights:
<pre class="example">
@@ -1556,23 +1572,26 @@ and versions.
2. Classification is currently supported, not regression.
-3. On the effect of database cluster size: as the database cluster
-size increases, the per iteration loss will be higher since the
-model only sees 1/n of the data, where n is the number of segments.
-However, each iteration runs faster than single node because it is only
-traversing 1/n of the data. For large data sets, all else being equal,
-a bigger cluster will achieve a given accuracy faster than a single node
-although it may take more iterations to achieve that accuracy.
-However, for highly non-convex solution spaces, convergence behavior
-may diminish as cluster size increases. Ensure that each segment has
-sufficient volume of data and examples of each class value.
-
@anchor background
@par Technical Background
For an introduction to deep learning foundations, including MLP and CNN,
refer to [6].
+This module trains a single large model across the database cluster
+using the bulk synchronous parallel (BSP) approach, with model averaging [7].
+
+On the effect of database cluster size: as the database cluster size increases, the per iteration
+loss will be higher since the model only sees 1/n of the data, where n is the number of segments.
+However, each iteration runs faster than single node because it is only traversing 1/n of the data.
+For highly non-convex solution spaces, convergence behavior may diminish as cluster size increases.
+Ensure that each segment has sufficient volume of data and examples of each class value.
+
+Alternatively, to train multiple models at the same time for model
+architecture search or hyperparameter tuning, you can
+use <a href="group__grp__keras__run__model__selection.html">Model Selection</a>,
+which does not do model averaging and hence may have better covergence efficiency.
+
@anchor literature
@literature
@@ -1591,6 +1610,10 @@ http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
[6] Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville, MIT Press, 2016.
+[7] "Resource-Efficient and Reproducible Model Selection on Deep Learning Systems," Supun Nakandala,
+Yuhao Zhang, and Arun Kumar, Technical Report, Computer Science and Engineering, University of California,
+San Diego https://adalabucsd.github.io/papers/TR_2019_Cerebro.pdf.
+
@anchor related
@par Related Topics
diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in
index 1ddbd18..c0a68b3 100644
--- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in
+++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in
@@ -1314,24 +1314,26 @@ and versions.
2. Classification is currently supported, not regression.
-3. On the effect of database cluster size: as the database cluster
-size increases, it will be proportionally faster to train a set of
-models, as long as you have at least as many model selection tuples
-as segments. This is because model state is "hopped" from
-segment to segment and training takes place in parallel. See [1,2]
-for details on how model hopping works. If you have fewer model selection
-tuples to train than segments, then some segments may not be busy 100%
-of the time so speedup will not necessarily be linear with database
-cluster size. Inference (predict) is an embarrassingly parallel
-operation so inference runtimes will be proportionally faster as the number
-of segments increases.
-
@anchor background
@par Technical Background
For an introduction to deep learning foundations, including MLP and CNN,
refer to [7].
+This module trains many models a time across the database cluster in order
+to explore network architectures and hyperparameters. It uses model hopper
+parallelism (MOP) and has high convergence efficiency since it does not do
+model averaging [2].
+
+On the effect of database cluster size: as the database cluster size increases,
+it will be proportionally faster to train a set of models, as long as you have at
+least as many model selection tuples as segments. This is because model state is "hopped" from
+segment to segment and training takes place in parallel [1,2]. If you have fewer model
+selection tuples to train than segments, then some
+segments may not be busy 100% of the time so speedup will not necessarily be linear with
+database cluster size. Inference (predict) is an embarrassingly parallel operation so
+inference runtimes will be proportionally faster as the number of segments increases.
+
@anchor literature
@literature
@@ -1340,7 +1342,7 @@ refer to [7].
Supun Nakandala, Yuhao Zhang, and Arun Kumar, ACM SIGMOD 2019 DEEM Workshop,
https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf
-[2] Resource-Efficient and Reproducible Model Selection on Deep Learning Systems,"
+[2] "Resource-Efficient and Reproducible Model Selection on Deep Learning Systems,"
Supun Nakandala, Yuhao Zhang, and Arun Kumar, Technical Report, Computer Science and
Engineering, University of California, San Diego
https://adalabucsd.github.io/papers/TR_2019_Cerebro.pdf
diff --git a/src/ports/postgres/modules/knn/knn.sql_in b/src/ports/postgres/modules/knn/knn.sql_in
index daeddc8..22822ed 100644
--- a/src/ports/postgres/modules/knn/knn.sql_in
+++ b/src/ports/postgres/modules/knn/knn.sql_in
@@ -121,6 +121,10 @@ in a column of type <tt>DOUBLE PRECISION[]</tt>.
<dd>TEXT. Name of the column with testing data points
or expression that evaluates to a numeric array</dd>
+@note
+For unsupervised nearest neighbors, make the test dataset the same as the source dataset,
+so the nearest neighbor of each point is the point itself, with a zero distance.
+
<dt>test_id</dt>
<dd>TEXT. Name of the column having ids of data points in test data table.</dd>