You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@madlib.apache.org by fm...@apache.org on 2019/04/20 00:24:23 UTC
[madlib] branch master updated: add sections to RF and DT user docs
on run-time and memory usage
This is an automated email from the ASF dual-hosted git repository.
fmcquillan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git
The following commit(s) were added to refs/heads/master by this push:
new 20c87fa add sections to RF and DT user docs on run-time and memory usage
20c87fa is described below
commit 20c87faefd3a166c5456112fba1c8b6ab107ad18
Author: Frank McQuillan <fm...@pivotal.io>
AuthorDate: Fri Apr 19 17:23:51 2019 -0700
add sections to RF and DT user docs on run-time and memory usage
---
.../deep_learning/keras_model_arch_table.sql_in | 2 +-
.../recursive_partitioning/decision_tree.sql_in | 34 +++++++++++++----
.../recursive_partitioning/random_forest.sql_in | 43 +++++++++++++++++-----
.../modules/regress/clustered_variance.sql_in | 6 +--
.../postgres/modules/sample/balance_sample.sql_in | 2 +-
src/ports/postgres/modules/svm/svm.sql_in | 4 +-
6 files changed, 67 insertions(+), 24 deletions(-)
diff --git a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
index bb734ab..16037c2 100644
--- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
+++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
@@ -129,7 +129,7 @@ model.add(Dense(3, name='dense_2'))
model.to_json
</pre>
This is represented by the following JSON:
-<pre class="example">
+<pre class="result">
'{"class_name": "Sequential", "keras_version": "2.1.6",
"config": [{"class_name": "Dense", "config": {"kernel_initializer":
{"class_name": "VarianceScaling", "config": {"distribution": "uniform",
diff --git a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
index 8ad7a9d..bf1c883 100644
--- a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
+++ b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
@@ -17,6 +17,7 @@ m4_include(`SQLCommon.m4')
<div class="toc"><b>Contents</b><ul>
<li class="level1"><a href="#train">Training Function</a></li>
+<li class="level1"><a href="#runtime">Run-time and Memory Usage</a></li>
<li class="level1"><a href="#predict">Prediction Function</a></li>
<li class="level1"><a href="#display">Tree Display</a></li>
<li class="level1"><a href="#display_importance">Importance Display</a></li>
@@ -109,7 +110,7 @@ tree_train(
by their value.
</DD>
- <DT>list_of_features_to_exclude</DT>
+ <DT>list_of_features_to_exclude (optional)</DT>
<DD>TEXT. Comma-separated string of column names to exclude from the predictors
list. If the <em>dependent_variable</em> is an expression (including cast of a column name),
then this list should include the columns present in the
@@ -118,7 +119,7 @@ tree_train(
The names in this parameter should be identical to the names used in the table and
quoted appropriately. </DD>
- <DT>split_criterion</DT>
+ <DT>split_criterion (optional)</DT>
<DD>TEXT, default = 'gini' for classification, 'mse' for regression.
Impurity function to compute the feature to use to split a node.
Supported criteria are 'gini', 'entropy', 'misclassification' for
@@ -148,7 +149,8 @@ tree_train(
<DD>INTEGER, default: 7. Maximum depth of any node of the final tree,
with the root node counted as depth 0. A deeper tree can
lead to better prediction but will also result in
- longer processing time and higher memory usage.</DD>
+ longer processing time and higher memory usage.
+ Current allowed maximum is 100.</DD>
<DT>min_split (optional)</DT>
<DD>INTEGER, default: 20. Minimum number of observations that must exist
@@ -475,11 +477,27 @@ provided <em>cp</em> and explore all possible sub-trees (up to a single-node tre
to compute the optimal sub-tree. The optimal sub-tree and the 'cp' corresponding
to this optimal sub-tree is placed in the <em>output_table</em>, with the
columns named as <em>tree</em> and <em>pruning_cp</em> respectively.
-- The main parameters that affect memory usage are: depth of
-tree (‘max_depth’), number of features, number of values per
-categorical feature, and number of bins for continuous features (‘num_splits’).
-If you are hitting memory limits, consider reducing one or
-more of these parameters.
+
+@anchor runtime
+@par Run-time and Memory Usage
+
+The number of features and the number of class values per categorical feature have a direct
+impact on run-time and memory. In addition, here is a summary of the main parameters
+in the training function that affect run-time and memory:
+
+| Parameter | Run-time | Memory | Notes |
+| :------ | :------ | :------ | :------ |
+| 'max_depth' | High | High | Deeper trees can take longer to run and use more memory. |
+| 'min_split' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. |
+| 'min_bucket' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. |
+| 'num_splits' | High | High | Depends on number of continuous variables. Effectively adds more features as the binning becomes more granular. |
+
+If you experience long run-times or are hitting memory limits, consider reducing one or
+more of these parameters. One approach when building a decision tree model is to start
+with a low maximum depth value and use suggested defaults for
+other parameters. This will give you a sense of run-time and test set accuracy.
+Then you can change maximum depth in a systematic way as required
+to improve accuracy.
@anchor predict
@par Prediction Function
diff --git a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
index ba0049b..251dfbc 100644
--- a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
+++ b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
@@ -17,6 +17,7 @@ m4_include(`SQLCommon.m4')
<div class="toc"><b>Contents</b><ul>
<li class="level1"><a href="#train">Training Function</a></li>
+<li class="level1"><a href="#runtime">Run-time and Memory Usage</a></li>
<li class="level1"><a href="#predict">Prediction Function</a></li>
<li class="level1"><a href="#get_tree">Tree Display</a></li>
<li class="level1"><a href="#get_importance">Importance Display</a></li>
@@ -139,7 +140,8 @@ forest_train(training_table_name,
<DT>num_random_features (optional)</DT>
<DD>INTEGER, default: sqrt(n) for classification, n/3
- for regression. This is the number of features to randomly
+ for regression, where n is the number of features.
+ This parameter is the number of features to randomly
select at each split.</DD>
<DT>importance (optional)</DT>
@@ -154,7 +156,8 @@ forest_train(training_table_name,
<DT>num_permutations (optional)</DT>
<DD>INTEGER, default: 1. Number of times to permute each feature value while
- calculating the out-of-bag variable importance.
+ calculating the out-of-bag variable importance. Only applies when
+ the 'importance' parameter is set to true.
@note Variable importance for a feature is determined by permuting the variable
and computing the drop in predictive accuracy using out-of-bag samples [1].
@@ -174,7 +177,10 @@ forest_train(training_table_name,
<DD>INTEGER, default: 7. Maximum depth of any node of a tree,
with the root node counted as depth 0. A deeper tree can
lead to better prediction but will also result in
- longer processing time and higher memory usage.</DD>
+ longer processing time and higher memory usage.
+ Current allowed maximum is 15. Note that since random forest
+ is an ensemble method, individual trees typically do not need
+ to be deep.</DD>
<DT>min_split (optional)</DT>
<DD>INTEGER, default: 20. Minimum number of observations that must exist
@@ -477,11 +483,30 @@ forest_train(training_table_name,
</DD>
</DL>
-@note The main parameters that affect memory usage are: depth of
-tree (‘max_tree_depth’), number of features, number of values per
-categorical feature, and number of bins for continuous features (‘num_splits’).
-If you are hitting memory limits, consider reducing one or
-more of these parameters.
+@anchor runtime
+@par Run-time and Memory Usage
+
+The number of features and the number of class values per categorical feature have a direct
+impact on run-time and memory. In addition, here is a summary of the main parameters
+in the training function that affect run-time and memory:
+
+| Parameter | Run-time | Memory | Notes |
+| :------ | :------ | :------ | :------ |
+| 'num_trees' | High | No or little effect. | Linear with number of trees. Notes that trees train sequentially one after another, though each tree is trained in parallel. |
+| 'importance' | Moderate | No or little effect. | Depends on number of features and 'num_permutations' parameter. |
+| 'num_permutations' | Moderate | No or little effect. | Depends on number of features. |
+| 'max_tree_depth' | High | High | Deeper trees can take longer to run and use more memory. |
+| 'min_split' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. |
+| 'min_bucket' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. |
+| 'num_splits' | High | High | Depends on number of continuous variables. Effectively adds more features as the binning becomes more granular. |
+| 'sample_ratio' | High | High | Reduces run time by using only some of the data. |
+
+If you experience long run-times or are hitting memory limits, consider reducing one or
+more of these parameters. One approach when building a random forest model is to start
+with a small number of trees and a low maximum depth value, and use suggested defaults for
+other parameters. This will give you a sense of run-time and test set accuracy.
+Then you can change number of trees and maximum depth in a systematic way as required
+to improve accuracy.
@anchor predict
@par Prediction Function
@@ -1446,7 +1471,7 @@ File random_forest.sql_in documenting the training function
* are to be used as predictors (except the ones included in
* the next argument). Boolean, integer, and text columns are
* considered categorical columns.
- * @param list_of_features_to_exclude OPTIONAL. List of column names
+ * @param list_of_features_to_exclude List of column names
* (comma-separated string) to exlude from the predictors list.
* @param grouping_cols OPTIONAL. List of column names (comma-separated
* string) to group the data by. This will lead to creating
diff --git a/src/ports/postgres/modules/regress/clustered_variance.sql_in b/src/ports/postgres/modules/regress/clustered_variance.sql_in
index afd83d0..f05630d 100644
--- a/src/ports/postgres/modules/regress/clustered_variance.sql_in
+++ b/src/ports/postgres/modules/regress/clustered_variance.sql_in
@@ -291,7 +291,7 @@ SELECT madlib.clustered_variance_linregr();
-# Run the linear regression function and view the results.
<pre class="example">
-DROP TABLE IF EXISTS out_table;
+DROP TABLE IF EXISTS out_table, out_table_summary;
SELECT madlib.clustered_variance_linregr( 'abalone',
'out_table',
'rings',
@@ -309,7 +309,7 @@ SELECT madlib.clustered_variance_logregr();
-# Run the logistic regression function and view the results.
<pre class="example">
-DROP TABLE IF EXISTS out_table;
+DROP TABLE IF EXISTS out_table, out_table_summary;
SELECT madlib.clustered_variance_logregr( 'abalone',
'out_table',
'rings < 10',
@@ -326,7 +326,7 @@ SELECT madlib.clustered_variance_mlogregr();
-# Run the multinomial logistic regression and view the results.
<pre class="example">
-DROP TABLE IF EXISTS out_table;
+DROP TABLE IF EXISTS out_table, out_table_summary;
SELECT madlib.clustered_variance_mlogregr( 'abalone',
'out_table',
'CASE WHEN rings < 10 THEN 1 ELSE 0 END',
diff --git a/src/ports/postgres/modules/sample/balance_sample.sql_in b/src/ports/postgres/modules/sample/balance_sample.sql_in
index eea73aa..15d86e6 100644
--- a/src/ports/postgres/modules/sample/balance_sample.sql_in
+++ b/src/ports/postgres/modules/sample/balance_sample.sql_in
@@ -185,7 +185,7 @@ The following table shows how the parameters 'class_size'
and 'output_table_size' work together:
| Case | 'class_size' | 'output_table_size' | Result |
-| :---- | :---- | :---- | :---- |
+| :------ | :------ | :----------------- | :-------- |
| 1 | 'uniform' | NULL | Resample for uniform class size with output size = input size (i.e., balanced). |
| 2 | 'uniform' | 10000 | Resample for uniform class size with output size = 10K (i.e., balanced). |
| 3 | NULL | NULL | Resample for uniform class size with output size = input size (i.e., balanced). Class_size=NULL has same behavior as ‘uniform’. |
diff --git a/src/ports/postgres/modules/svm/svm.sql_in b/src/ports/postgres/modules/svm/svm.sql_in
index ddfa134..a55fd5f 100644
--- a/src/ports/postgres/modules/svm/svm.sql_in
+++ b/src/ports/postgres/modules/svm/svm.sql_in
@@ -414,8 +414,8 @@ resulting \e init_stepsize can be run on the whole dataset.
<DT>tolerance</dt>
<DD>Default: 1e-10. The criterion to end iterations. The training stops whenever
-<the difference between the training models of two consecutive iterations is
-<smaller than \e tolerance or the iteration number is larger than \e max_iter.
+the difference between the training models of two consecutive iterations is
+smaller than \e tolerance or the iteration number is larger than \e max_iter.
</DD>
<DT>lambda</dt>