You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@madlib.apache.org by ok...@apache.org on 2020/05/19 18:38:05 UTC

[madlib] branch master updated: dbscan user docs

This is an automated email from the ASF dual-hosted git repository.

okislal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git


The following commit(s) were added to refs/heads/master by this push:
     new 2047925  dbscan user docs
2047925 is described below

commit 204792585a04031b0cdeb3cb947bddfa072f1df8
Author: Frank McQuillan <fm...@pivotal.io>
AuthorDate: Thu May 14 17:51:05 2020 -0700

    dbscan user docs
---
 doc/mainpage.dox.in                             |   2 +-
 src/ports/postgres/modules/dbscan/dbscan.sql_in | 176 +++++++++++++++++++-----
 2 files changed, 145 insertions(+), 33 deletions(-)

diff --git a/doc/mainpage.dox.in b/doc/mainpage.dox.in
index f4906a6..a02500e 100644
--- a/doc/mainpage.dox.in
+++ b/doc/mainpage.dox.in
@@ -240,8 +240,8 @@ complete matrix stored as a distributed table.
     @brief Methods for clustering data.
     @details Methods for clustering data.
     @{
-        @defgroup grp_kmeans k-Means Clustering
         @defgroup grp_dbscan DBSCAN
+        @defgroup grp_kmeans k-Means Clustering
     @}
 
     @defgroup grp_pca Dimensionality Reduction
diff --git a/src/ports/postgres/modules/dbscan/dbscan.sql_in b/src/ports/postgres/modules/dbscan/dbscan.sql_in
index 976d065..7682efc 100644
--- a/src/ports/postgres/modules/dbscan/dbscan.sql_in
+++ b/src/ports/postgres/modules/dbscan/dbscan.sql_in
@@ -47,8 +47,9 @@ shape [1]. It places minimum requirements on domain knowledge to determine
 input parameters and has good efficiency on large databases.
 
 Given a set of points, DBSCAN groups together points that are closely packed with many
-nearby neighbors (core samples), and marks as outliers points that lie alone in
-low-density regions with few nearby neighbors (non-core samples).
+nearby neighbors (\em core), and marks points that lie alone in
+low-density regions with few nearby neighbors (\em border).  Other points in the dataset
+that are not core or border are considered as \em noise.
 This method tends to be good for data which contains clusters
 of similar density.
 
@@ -92,8 +93,7 @@ or an expression that evaluates to an array of point coordinates.
 be considered in the neighborhood of the other. (Note this is not a maximum
 bound on the distances of points within a cluster.) This is an
 important parameter to choose appropriately and should consider both the
-nature of the data set and the distance function. Usually it is not left
-at the default value.
+nature of the data set and the distance function.
 </dd>
 
 <dt>min_samples (optional)</dt>
@@ -106,9 +106,8 @@ the magnitude of this parameter.
 
 @note
 The parameters 'eps' and 'min_samples' together define the density of a cluster.
-A core sample is a sample in the dataset such
-that there exist 'min_samples' other samples within a distance of 'eps',
-which are defined as neighbors of the core sample.
+A core point is where there exist 'min_samples' other points within a distance of 'eps',
+which are defined as neighbors of the core point.
 A higher value of 'min_samples' or a lower value of 'eps' indicate that higher density
 is needed to form a cluster.
 
@@ -126,37 +125,28 @@ performance reasons.</li>
 <li><b>\ref dist_tanimoto</b>: tanimoto (element-wise mean of normalized points)</li>
 </dd>
 
-<dt>algorithm TBD (optional)</dt>
+<dt>algorithm TBD??? (optional)</dt>
 <dd>TEXT, default: 'brute_force'. The name of the algorithm
 used to compute clusters. The following options are supported:
 <ul>
 <li><b>\ref brute_force</b>: Produces an exact result by searching
-all points in the search space.  You can also use a short
+all points in the search space.  Brute force can be slow and is intended
+to be used for small datasets only.  You can also use a short
 form "b" or "brute" etc. to select brute force.</li>
-<li><b>\ref xx_tree</b>: Produces an approximate result by searching
-a subset of the search space, that is, only certain leaf nodes in the
-xx-tree as specified by "algorithm_params" below.
+<li><b>\ref kd_tree</b>: Uses a tree structure to reduce the amount
+of search required to form clusters
+The depth of the kd-tree to search is specified
+by the "algorithm_params" below.
 You can also use a short
-form "x" or "xx" etc. to select xx-tree.</li></ul></dd>
+form "k" or "kd" etc. to select kd-tree.</li></ul></dd>
 
-<dt>algorithm_params TBD (optional)</dt>
-<dd>TEXT, default: 'depth=3, leaf_nodes=2'. These parameters apply to the
-xx-tree algorithm only.
-<ul>
-<li><b>\ref depth</b>: Depth of the xx-tree. Increasing this value will
-decrease run-time but reduce the accuracy.</li>
-<li><b>\ref leaf_nodes</b>: Number of leaf nodes (regions) to search for each test point.
-Inceasing this value will improve the accuracy but increase run-time.</li></ul>
-
-@note
-Please note that the xx-tree accuracy will be lower for datasets with a high
-number of features. It is advised to use at least two leaf nodes.
-Refer to the <a href="#background">Technical Background</a> for more information
-on how the xx-tree is implemented.</dd>
+<dt>algorithm_params (optional)</dt>
+<dd>TEXT, default: 'depth=3'. This parameters apply to the
+kd-tree algorithm only.  Increasing the depth of the tree will
+decrease the run-time but reduce the accuracy. TBD???</dd>
 
 </dl>
 
-<<<<<<< Updated upstream
 <b>Output</b>
 <br>
 The output table for DBSCAN module has the following columns:
@@ -171,7 +161,10 @@ The output table for DBSCAN module has the following columns:
     </tr>
     <tr>
         <th>is_core_point</th>
-        <td>BOOLEAN. Indicates if the test data point is core or non-core.</td>
+        <td>BOOLEAN. Indicates if the test data point is core
+        If it is not a core point and listed in the output table,
+        it is a border point.  Noise points are not shown in this
+        table.</td>
     </tr>
     <tr>
         <th>points</th>
@@ -195,13 +188,13 @@ dbscan_predict( dbscan_table,
 <dt>dbscan_table</dt>
 <dd>TEXT. Name of the table created by running DBSCAN.</dd>
 
-<dt>x</dt>
+<dt>new_point</dt>
 <dd>DOUBLE PRECISION[]. New points to be assigned to clusters.</dd>
 </dl>
 
-<b>Output TBD</b>
+<b>Output TBD???</b>
 <br>
-The output table has the following columns:
+The output is a composite type with the following columns:
 <table class="output">
     <tr>
       <th>column_id</th>
@@ -216,6 +209,125 @@ The output table has the following columns:
 @anchor examples
 @par Examples
 
+-# Create input data:
+<pre class="example">
+DROP TABLE IF EXISTS dbscan_train_data;
+CREATE TABLE dbscan_train_data (pid int, points double precision[]);
+INSERT INTO dbscan_train_data VALUES
+(1,  '{1, 1}'),
+(2,  '{2, 1}'),
+(3,  '{1, 2}'),
+(4,  '{2, 2}'),
+(5,  '{3, 5}'),
+(6,  '{3, 9}'),
+(7,  '{3, 10}'),
+(8,  '{4, 10}'),
+(9,  '{4, 11}'),
+(10,  '{5, 10}'),
+(11,  '{7, 10}'),
+(12,  '{10, 9}'),
+(13,  '{10, 6}'),
+(14,  '{9, 5}'),
+(15,  '{10, 5}'),
+(16,  '{11, 5}'),
+(17,  '{9, 4}'),
+(18,  '{10, 4}'),
+(19,  '{11, 4}'),
+(20,  '{10, 3}');
+</pre>
+-#  Run DBSCAN using the brute force method with a Euclidean
+distance function:
+<pre class="example">
+DROP TABLE IF EXISTS dbscan_result, dbscan_result_summary;
+SELECT madlib.dbscan( 
+                'dbscan_train_data',    -- source table
+                'dbscan_result',        -- output table
+                'pid',                  -- point id column
+                'points',               -- data point
+                 1.75,                  -- epsilon
+                 4,                     -- min samples
+                'dist_norm2',           -- metric
+                'brute_force');         -- algorithm
+SELECT * FROM dbscan_result ORDER BY pid;
+</pre>
+<pre class="result">
+ pid | cluster_id | is_core_point | __points__ 
+-----+------------+---------------+------------
+   1 |          0 | t             | {1,1}
+   2 |          0 | t             | {2,1}
+   3 |          0 | t             | {1,2}
+   4 |          0 | t             | {2,2}
+   6 |          1 | f             | {3,9}
+   7 |          1 | t             | {3,10}
+   8 |          1 | t             | {4,10}
+   9 |          1 | t             | {4,11}
+  10 |          1 | f             | {5,10}
+  13 |          2 | t             | {10,6}
+  14 |          2 | t             | {9,5}
+  15 |          2 | t             | {10,5}
+  16 |          2 | t             | {11,5}
+  17 |          2 | t             | {9,4}
+  18 |          2 | t             | {10,4}
+  19 |          2 | t             | {11,4}
+  20 |          2 | t             | {10,3}
+(17 rows)
+</pre>
+There are three clusters created.  All points are core points
+except for 6 and 10 which are border points.  The noise points
+do not show up in the output table. If you want to see the noise points 
+you can use a query like:
+<pre class="example">
+SELECT l.* FROM dbscan_train_data l WHERE NOT EXISTS 
+    (SELECT NULL FROM dbscan_result r WHERE r.pid = l.pid) 
+    ORDER BY l.pid;
+</pre>
+<pre class="result">
+ pid | points 
+-----+--------
+   5 | {3,5}
+  11 | {7,10}
+  12 | {10,9}
+</pre>
+The summary table lists the 'eps' value and the distance metric used:
+<pre class="example">
+SELECT * FROM dbscan_result_summary;
+</pre>
+<pre class="result">
+ id_column | eps  |   metric   
+-----------+------+------------
+ pid       | 1.75 | dist_norm2
+</pre>
+-#  Find the cluster assignment.  In this example we use the same source
+points for demonstration purposes:
+<pre class="example">
+SELECT pid, madlib.dbscan_predict (
+                        'dbscan_result',   -- from DBSCAN run
+                         points)           -- data to cluster
+FROM dbscan_train_data ORDER BY pid;
+</pre>
+<pre class="result">
+TBD???
+</pre>
+-#  Now let's run DBSCAN using the kd-tree method with a Euclidean
+distance function:
+<pre class="example">
+DROP TABLE IF EXISTS dbscan_result_kd, dbscan_result_kd_summary;
+SELECT madlib.dbscan( 
+                'dbscan_train_data',    -- source table
+                'dbscan_result_kd',     -- output table
+                'pid',                  -- point id column
+                'points',               -- data point
+                 1.75,                  -- epsilon
+                 4,                     -- min samples
+                'dist_norm2',           -- metric
+                'kd_tree',              -- algorithm
+                'depth=3');             -- depth of kd-tree
+SELECT * FROM dbscan_result_kd ORDER BY pid;
+</pre>
+<pre class="result">
+TBD???
+</pre>
+
 @anchor literature
 @literature