You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@madlib.apache.org by fm...@apache.org on 2019/04/20 00:24:23 UTC

[madlib] branch master updated: add sections to RF and DT user docs on run-time and memory usage

This is an automated email from the ASF dual-hosted git repository.

fmcquillan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git


The following commit(s) were added to refs/heads/master by this push:
     new 20c87fa  add sections to RF and DT user docs on run-time and memory usage
20c87fa is described below

commit 20c87faefd3a166c5456112fba1c8b6ab107ad18
Author: Frank McQuillan <fm...@pivotal.io>
AuthorDate: Fri Apr 19 17:23:51 2019 -0700

    add sections to RF and DT user docs on run-time and memory usage
---
 .../deep_learning/keras_model_arch_table.sql_in    |  2 +-
 .../recursive_partitioning/decision_tree.sql_in    | 34 +++++++++++++----
 .../recursive_partitioning/random_forest.sql_in    | 43 +++++++++++++++++-----
 .../modules/regress/clustered_variance.sql_in      |  6 +--
 .../postgres/modules/sample/balance_sample.sql_in  |  2 +-
 src/ports/postgres/modules/svm/svm.sql_in          |  4 +-
 6 files changed, 67 insertions(+), 24 deletions(-)

diff --git a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
index bb734ab..16037c2 100644
--- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
+++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
@@ -129,7 +129,7 @@ model.add(Dense(3, name='dense_2'))
 model.to_json
 </pre>
 This is represented by the following JSON:
-<pre class="example">
+<pre class="result">
 '{"class_name": "Sequential", "keras_version": "2.1.6",
 "config": [{"class_name": "Dense", "config": {"kernel_initializer":
 {"class_name": "VarianceScaling", "config": {"distribution": "uniform",
diff --git a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
index 8ad7a9d..bf1c883 100644
--- a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
+++ b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
@@ -17,6 +17,7 @@ m4_include(`SQLCommon.m4')
 
 <div class="toc"><b>Contents</b><ul>
 <li class="level1"><a href="#train">Training Function</a></li>
+<li class="level1"><a href="#runtime">Run-time and Memory Usage</a></li>
 <li class="level1"><a href="#predict">Prediction Function</a></li>
 <li class="level1"><a href="#display">Tree Display</a></li>
 <li class="level1"><a href="#display_importance">Importance Display</a></li>
@@ -109,7 +110,7 @@ tree_train(
   by their value.
   </DD>
 
-  <DT>list_of_features_to_exclude</DT>
+  <DT>list_of_features_to_exclude (optional)</DT>
   <DD>TEXT. Comma-separated string of column names to exclude from the predictors
       list. If the <em>dependent_variable</em> is an expression (including cast of a column name),
       then this list should include the columns present in the
@@ -118,7 +119,7 @@ tree_train(
       The names in this parameter should be identical to the names used in the table and
       quoted appropriately. </DD>
 
-  <DT>split_criterion</DT>
+  <DT>split_criterion (optional)</DT>
   <DD>TEXT, default = 'gini' for classification, 'mse' for regression.
   Impurity function to compute the feature to use to split a node.
   Supported criteria are 'gini', 'entropy', 'misclassification' for
@@ -148,7 +149,8 @@ tree_train(
   <DD>INTEGER, default: 7. Maximum depth of any node of the final tree,
       with the root node counted as depth 0. A deeper tree can
       lead to better prediction but will also result in
-      longer processing time and higher memory usage.</DD>
+      longer processing time and higher memory usage.
+      Current allowed maximum is 100.</DD>
 
   <DT>min_split (optional)</DT>
   <DD>INTEGER, default: 20. Minimum number of observations that must exist
@@ -475,11 +477,27 @@ provided <em>cp</em> and explore all possible sub-trees (up to a single-node tre
 to compute the optimal sub-tree. The optimal sub-tree and the 'cp' corresponding
 to this optimal sub-tree is placed in the <em>output_table</em>, with the
 columns named as <em>tree</em> and <em>pruning_cp</em> respectively.
-- The main parameters that affect memory usage are: depth of
-tree (‘max_depth’), number of features, number of values per
-categorical feature, and number of bins for continuous features (‘num_splits’).
-If you are hitting memory limits, consider reducing one or
-more of these parameters.
+
+@anchor runtime
+@par Run-time and Memory Usage
+
+The number of features and the number of class values per categorical feature have a direct
+impact on run-time and memory.  In addition, here is a summary of the main parameters
+in the training function that affect run-time and memory:
+
+| Parameter | Run-time | Memory | Notes |
+| :------ | :------ | :------ | :------ |
+| 'max_depth' | High | High | Deeper trees can take longer to run and use more memory. |
+| 'min_split' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. |
+| 'min_bucket' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. |
+| 'num_splits' | High | High | Depends on number of continuous variables.  Effectively adds more features as the binning becomes more granular. |
+
+If you experience long run-times or are hitting memory limits, consider reducing one or
+more of these parameters. One approach when building a decision tree model is to start
+with a low maximum depth value and use suggested defaults for
+other parameters. This will give you a sense of run-time and test set accuracy.
+Then you can change maximum depth in a systematic way as required
+to improve accuracy.
 
 @anchor predict
 @par Prediction Function
diff --git a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
index ba0049b..251dfbc 100644
--- a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
+++ b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
@@ -17,6 +17,7 @@ m4_include(`SQLCommon.m4')
 
 <div class="toc"><b>Contents</b><ul>
 <li class="level1"><a href="#train">Training Function</a></li>
+<li class="level1"><a href="#runtime">Run-time and Memory Usage</a></li>
 <li class="level1"><a href="#predict">Prediction Function</a></li>
 <li class="level1"><a href="#get_tree">Tree Display</a></li>
 <li class="level1"><a href="#get_importance">Importance Display</a></li>
@@ -139,7 +140,8 @@ forest_train(training_table_name,
 
   <DT>num_random_features (optional)</DT>
   <DD>INTEGER, default: sqrt(n) for classification, n/3
-  for regression. This is the number of features to randomly
+  for regression, where n is the number of features.
+  This parameter is the number of features to randomly
   select at each split.</DD>
 
   <DT>importance (optional)</DT>
@@ -154,7 +156,8 @@ forest_train(training_table_name,
 
   <DT>num_permutations (optional)</DT>
   <DD>INTEGER, default: 1. Number of times to permute each feature value while
-      calculating the out-of-bag variable importance.
+      calculating the out-of-bag variable importance.  Only applies when
+      the 'importance' parameter is set to true.
 
     @note Variable importance for a feature is determined by permuting the variable
     and computing the drop in predictive accuracy using out-of-bag samples [1].
@@ -174,7 +177,10 @@ forest_train(training_table_name,
   <DD>INTEGER, default: 7. Maximum depth of any node of a tree,
       with the root node counted as depth 0. A deeper tree can
       lead to better prediction but will also result in
-      longer processing time and higher memory usage.</DD>
+      longer processing time and higher memory usage.
+      Current allowed maximum is 15.  Note that since random forest
+      is an ensemble method, individual trees typically do not need
+      to be deep.</DD>
 
   <DT>min_split (optional)</DT>
   <DD>INTEGER, default: 20. Minimum number of observations that must exist
@@ -477,11 +483,30 @@ forest_train(training_table_name,
   </DD>
 </DL>
 
-@note The main parameters that affect memory usage are: depth of
-tree (‘max_tree_depth’), number of features, number of values per
-categorical feature, and number of bins for continuous features (‘num_splits’).
-If you are hitting memory limits, consider reducing one or
-more of these parameters.
+@anchor runtime
+@par Run-time and Memory Usage
+
+The number of features and the number of class values per categorical feature have a direct
+impact on run-time and memory.  In addition, here is a summary of the main parameters
+in the training function that affect run-time and memory:
+
+| Parameter | Run-time | Memory | Notes |
+| :------ | :------ | :------ | :------ |
+| 'num_trees' | High | No or little effect. | Linear with number of trees.  Notes that trees train sequentially one after another, though each tree is trained in parallel. |
+| 'importance' | Moderate | No or little effect. | Depends on number of features and 'num_permutations' parameter. |
+| 'num_permutations' | Moderate | No or little effect. | Depends on number of features. |
+| 'max_tree_depth' | High | High | Deeper trees can take longer to run and use more memory. |
+| 'min_split' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. |
+| 'min_bucket' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. |
+| 'num_splits' | High | High | Depends on number of continuous variables.  Effectively adds more features as the binning becomes more granular. |
+| 'sample_ratio' | High | High | Reduces run time by using only some of the data. |
+
+If you experience long run-times or are hitting memory limits, consider reducing one or
+more of these parameters. One approach when building a random forest model is to start
+with a small number of trees and a low maximum depth value, and use suggested defaults for
+other parameters. This will give you a sense of run-time and test set accuracy.
+Then you can change number of trees and maximum depth in a systematic way as required
+to improve accuracy.
 
 @anchor predict
 @par Prediction Function
@@ -1446,7 +1471,7 @@ File random_forest.sql_in documenting the training function
   *        are to be used as predictors (except the ones included in
   *        the next argument). Boolean, integer, and text columns are
   *        considered categorical columns.
-  * @param list_of_features_to_exclude OPTIONAL. List of column names
+  * @param list_of_features_to_exclude List of column names
   *        (comma-separated string) to exlude from the predictors list.
   * @param grouping_cols OPTIONAL. List of column names (comma-separated
   *        string) to group the data by. This will lead to creating
diff --git a/src/ports/postgres/modules/regress/clustered_variance.sql_in b/src/ports/postgres/modules/regress/clustered_variance.sql_in
index afd83d0..f05630d 100644
--- a/src/ports/postgres/modules/regress/clustered_variance.sql_in
+++ b/src/ports/postgres/modules/regress/clustered_variance.sql_in
@@ -291,7 +291,7 @@ SELECT madlib.clustered_variance_linregr();
 
 -# Run the linear regression function and view the results.
 <pre class="example">
-DROP TABLE IF EXISTS out_table;
+DROP TABLE IF EXISTS out_table, out_table_summary;
 SELECT madlib.clustered_variance_linregr( 'abalone',
                                           'out_table',
                                           'rings',
@@ -309,7 +309,7 @@ SELECT madlib.clustered_variance_logregr();
 
 -# Run the logistic regression function and view the results.
 <pre class="example">
-DROP TABLE IF EXISTS out_table;
+DROP TABLE IF EXISTS out_table, out_table_summary;
 SELECT madlib.clustered_variance_logregr( 'abalone',
                                           'out_table',
                                           'rings < 10',
@@ -326,7 +326,7 @@ SELECT madlib.clustered_variance_mlogregr();
 
 -# Run the multinomial logistic regression and view the results.
 <pre class="example">
-DROP TABLE IF EXISTS out_table;
+DROP TABLE IF EXISTS out_table, out_table_summary;
 SELECT madlib.clustered_variance_mlogregr( 'abalone',
                                            'out_table',
                                            'CASE WHEN rings < 10 THEN 1 ELSE 0 END',
diff --git a/src/ports/postgres/modules/sample/balance_sample.sql_in b/src/ports/postgres/modules/sample/balance_sample.sql_in
index eea73aa..15d86e6 100644
--- a/src/ports/postgres/modules/sample/balance_sample.sql_in
+++ b/src/ports/postgres/modules/sample/balance_sample.sql_in
@@ -185,7 +185,7 @@ The following table shows how the parameters 'class_size'
 and 'output_table_size' work together:
 
 | Case | 'class_size' | 'output_table_size' | Result |
-| :---- | :---- | :---- | :---- |
+| :------ | :------ | :----------------- | :-------- |
 | 1 | 'uniform' | NULL | Resample for uniform class size with output size = input size (i.e., balanced). |
 | 2 | 'uniform' | 10000 | Resample for uniform class size with output size = 10K (i.e., balanced). |
 | 3 | NULL | NULL | Resample for uniform class size with output size = input size (i.e., balanced). Class_size=NULL has same behavior as ‘uniform’. |
diff --git a/src/ports/postgres/modules/svm/svm.sql_in b/src/ports/postgres/modules/svm/svm.sql_in
index ddfa134..a55fd5f 100644
--- a/src/ports/postgres/modules/svm/svm.sql_in
+++ b/src/ports/postgres/modules/svm/svm.sql_in
@@ -414,8 +414,8 @@ resulting \e init_stepsize can be run on the whole dataset.
 
 <DT>tolerance</dt>
 <DD>Default: 1e-10. The criterion to end iterations. The training stops whenever
-<the difference between the training models of two consecutive iterations is
-<smaller than \e tolerance or the iteration number is larger than \e max_iter.
+the difference between the training models of two consecutive iterations is
+smaller than \e tolerance or the iteration number is larger than \e max_iter.
 </DD>
 
 <DT>lambda</dt>