You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@madlib.apache.org by nj...@apache.org on 2018/02/23 20:28:25 UTC

madlib git commit: Docs: update KNN, DT and RF docs to match recent commits

Repository: madlib
Updated Branches:
  refs/heads/master fa6d53a42 -> b6f4fa1f5


Docs: update KNN, DT and RF docs to match recent commits

Closes #235


Project: http://git-wip-us.apache.org/repos/asf/madlib/repo
Commit: http://git-wip-us.apache.org/repos/asf/madlib/commit/b6f4fa1f
Tree: http://git-wip-us.apache.org/repos/asf/madlib/tree/b6f4fa1f
Diff: http://git-wip-us.apache.org/repos/asf/madlib/diff/b6f4fa1f

Branch: refs/heads/master
Commit: b6f4fa1f508e0c51f0e86c114c294f9448f55d99
Parents: fa6d53a
Author: Frank McQuillan <fm...@pivotal.io>
Authored: Tue Feb 13 16:16:57 2018 -0800
Committer: Nandish Jayaram <nj...@apache.org>
Committed: Fri Feb 23 12:23:29 2018 -0800

----------------------------------------------------------------------
 src/ports/postgres/modules/knn/knn.sql_in       | 19 ++++++++++---
 .../recursive_partitioning/decision_tree.sql_in | 13 +++++++++
 .../recursive_partitioning/random_forest.sql_in | 28 +++++++++++++++-----
 3 files changed, 50 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/madlib/blob/b6f4fa1f/src/ports/postgres/modules/knn/knn.sql_in
----------------------------------------------------------------------
diff --git a/src/ports/postgres/modules/knn/knn.sql_in b/src/ports/postgres/modules/knn/knn.sql_in
index 3139c15..1a90652 100644
--- a/src/ports/postgres/modules/knn/knn.sql_in
+++ b/src/ports/postgres/modules/knn/knn.sql_in
@@ -147,9 +147,18 @@ The following distance functions can be used:
 <li><b>user defined function</b> with signature <tt>DOUBLE PRECISION[] x, DOUBLE PRECISION[] y -> DOUBLE PRECISION</tt></li></ul></dd>
 
 <dt>weighted_avg (optional)</dt>
-<dd>BOOLEAN, default: FALSE. Calculates the Regression or classication 
-of k-NN using the weighted average method.
-
+<dd>BOOLEAN, default: FALSE. Calculates classification or
+regression values using a weighted average.   The idea is to 
+weigh the contribution of each of the k neighbors according 
+to their distance to the test point, giving greater influence
+to closer neighbors.  The distance function 'fn_dist' specified
+above is used.
+
+For classification, majority voting weighs a neighbor
+according to inverse distance.
+
+For regression, the inverse distance weighting approach is
+used from Shepard [4].
 </dl>
 
 
@@ -392,6 +401,10 @@ is assigned to the test point.
 [3] Gongde Guo1, Hui Wang, David Bell, Yaxin Bi, Kieran Greer: KNN Model-Based Approach in Classification,
     https://ai2-s2-pdfs.s3.amazonaws.com/a7e2/814ec5db800d2f8c4313fd436e9cf8273821.pdf
 
+@anchor knn-lit-4
+[4]    Shepard, Donald (1968). "A two-dimensional interpolation function for 
+irregularly-spaced data". Proceedings of the 1968 ACM National Conference. pp. 517–524. 
+
 @internal
 @sa namespace knn (documenting the implementation in Python)
 @endinternal

http://git-wip-us.apache.org/repos/asf/madlib/blob/b6f4fa1f/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
----------------------------------------------------------------------
diff --git a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
index 0878b10..eb0e760 100644
--- a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
+++ b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
@@ -355,6 +355,19 @@ tree_train(
     <th>independent_var_types</th>
     <td>TEXT. A comma separated string for the types of independent variables.</td>
     </tr>
+
+    <tr>
+    <th>n_folds</th>
+    <td>BIGINT. Number of cross-validation folds used.</td>
+    </tr>
+
+    <tr>
+    <th>null_proxy</th>
+    <td>TEXT. Describes how NULLs are handled.  If NULL is not 
+    treated as a separate categorical variable, this will be NULL.
+    If NULL is treated as a separate categorical value, this will be 
+    set to "__NULL__"</td>
+    </tr>
    </table>
   </DD>
 </DL>

http://git-wip-us.apache.org/repos/asf/madlib/blob/b6f4fa1f/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
----------------------------------------------------------------------
diff --git a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
index b74288a..cc228ac 100644
--- a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
+++ b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
@@ -208,13 +208,26 @@ forest_train(training_table_name,
 
     <tr>
     <th>dependent_var_levels</th>
-    <td>itext. For classification, the distinct levels of the dependent variable.</td>
+    <td>text. For classification, the distinct levels of the dependent variable.</td>
     </tr>
 
     <tr>
     <th>dependent_var_type</th>
     <td>text. The type of dependent variable.</td>
     </tr>
+
+    <tr>
+    <th>independent_var_types</th>
+    <td>text. A comma separated string for the types of independent variables.</td>
+    </tr>
+
+    <tr>
+    <th>null_proxy</th>
+    <td>text. Describes how NULLs are handled.  If NULL is not 
+    treated as a separate categorical variable, this will be NULL.
+    If NULL is treated as a separate categorical value, this will be 
+    set to "__NULL__"</td>
+    </tr>
     </table>
 
     A group table named <em> \<model_table\>_group</em> is created, which has the following columns:
@@ -374,7 +387,7 @@ forest_train(training_table_name,
       variable comes into use when the primary predictior value is NULL.
     </tr>
     <tr>
-    <th>null_as_special_cat</th>
+    <th>null_as_category</th>
     <td>Default: FALSE. Whether to treat NULL as a special categorical value.
 
     If this is set to TRUE, NULL values are considered a categorical
@@ -564,7 +577,7 @@ dependent_varname     | class
 independent_varnames  | "OUTLOOK",windy,temperature,humidity
 cat_features          | "OUTLOOK",windy
 con_features          | temperature,humidity
-grouping_cols         |
+grouping_cols         | 
 num_trees             | 20
 num_random_features   | 2
 max_tree_depth        | 8
@@ -581,6 +594,7 @@ total_rows_skipped    | 0
 dependent_var_levels  | "Don't Play","Play"
 dependent_var_type    | text
 independent_var_types | text, text, double precision, double precision
+null_proxy            | None
 </pre>
 View the group table output:
 <pre class="example">
@@ -592,10 +606,10 @@ Result:
 gid                | 1
 success            | t
 cat_n_levels       | {3,2}
-cat_levels_in_text | {overcast,rain,sunny,false,true}
-oob_error          | 0.50000000000000000000
-cat_var_importance | {-0.206309523809524,-0.234345238095238}
-con_var_importance | {-0.308690476190476,-0.272678571428571}
+cat_levels_in_text | {overcast,sunny,rain,false,true}
+oob_error          | 0.42857142857142857143
+cat_var_importance | {0.0305555555555556,0.0626984126984127}
+con_var_importance | {0,0.0243650793650794}
 </pre>
 
 -# Obtain a dot format display of a single tree