You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@madlib.apache.org by ok...@apache.org on 2019/09/19 17:24:15 UTC
[madlib] 02/02: update user docs for auto-k and per-point silh and
generally reorganize
This is an automated email from the ASF dual-hosted git repository.
okislal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git
commit dae72a0c3a4bf3ba9ac3837c51190932e1d62ba3
Author: Frank McQuillan <fm...@pivotal.io>
AuthorDate: Wed Sep 18 09:00:18 2019 -0700
update user docs for auto-k and per-point silh and generally reorganize
---
src/ports/postgres/modules/kmeans/kmeans.sql_in | 739 +++++++++++++++++++-----
1 file changed, 586 insertions(+), 153 deletions(-)
diff --git a/src/ports/postgres/modules/kmeans/kmeans.sql_in b/src/ports/postgres/modules/kmeans/kmeans.sql_in
index e354e05..50f6500 100644
--- a/src/ports/postgres/modules/kmeans/kmeans.sql_in
+++ b/src/ports/postgres/modules/kmeans/kmeans.sql_in
@@ -16,18 +16,19 @@ m4_include(`SQLCommon.m4')
<div class="toc"><b>Contents</b>
<ul>
-<li class="level1"><a href="#train">Training Function</a></li>
-<li class="level1"><a href="#output">Output Format</a></li>
+<li class="level1"><a href="#cluster">Clustering Function</a></li>
+<li class="level1"><a href="#auto_cluster">Auto Clustering Function</a></li>
<li class="level1"><a href="#assignment">Cluster Assignment</a></li>
+<li class="level1"><a href="#silh">Simple Silhouette</a></li>
<li class="level1"><a href="#examples">Examples</a></li>
-<li class="level1"><a href="#notes">Notes</a></li>
<li class="level1"><a href="#background">Technical Background</a></li>
<li class="level1"><a href="#literature">Literature</a></li>
<li class="level1"><a href="#related">Related Topics</a></li>
</ul>
</div>
-@brief Partitions a set of observations into clusters by finding centroids that minimize the sum of observations' distances from their closest centroid.
+@brief Partitions a set of observations into clusters by finding centroids that
+minimize the sum of observations' distances from their closest centroid.
Clustering refers to the problem of partitioning a set of objects according to
some problem-dependent measure of <em>similarity</em>. In the k-means variant,
@@ -37,13 +38,19 @@ so that the sum of \em distances between each point and its closest centroid is
minimized. Each centroid represents a cluster that consists of all points to
which this centroid is closest.
-@anchor train
-@par Training Function
+This module can compute clusters given the number of centroids <em>k</em> as an input,
+using a variety of seeding methods.
+It can also automatically select the best <em>k</em> value from a range
+of suggested <em>k</em> values, using the simplified silhouette method
+or the elbow method.
-The k-means algorithm can be invoked in four ways, depending on the
+@anchor cluster
+@par Clustering Function
+
+The k-means algorithm can be invoked in different ways, depending on the
source of the initial set of centroids:
-- Use the random centroid seeding method.
+- Uniform random centroid seeding method.
<pre class="syntax">
kmeans_random( rel_source,
expr_point,
@@ -55,7 +62,10 @@ kmeans_random( rel_source,
)
</pre>
-- Use the kmeans++ centroid seeding method.
+- k-means++ centroid seeding method <a href="#kmeans-lit-1">[1]</a>.
+This method can speed up convergence
+by seeding centroids spread out over the whole range of the input
+points, while at the same time not being too susceptible to outliers.
<pre class="syntax">
kmeanspp( rel_source,
expr_point,
@@ -68,7 +78,9 @@ kmeanspp( rel_source,
)
</pre>
-- Supply an initial centroid set in a relation identified by the \e rel_initial_centroids argument.
+- Supply an initial centroid set in a relation identified
+by the \e rel_initial_centroids argument, for the case
+where initial centroids are stored in a table.
<pre class="syntax">
kmeans( rel_source,
expr_point,
@@ -81,7 +93,8 @@ kmeans( rel_source,
)
</pre>
-- Provide an initial centroid set as an array expression in the \e initial_centroids argument.
+- Provide an initial centroid set as an array expression
+in the \e initial_centroids argument.
<pre class="syntax">
kmeans( rel_source,
expr_point,
@@ -92,12 +105,12 @@ kmeans( rel_source,
min_frac_reassigned
)
</pre>
+
\b Arguments
<dl class="arglist">
<dt>rel_source</dt>
-<dd>TEXT. The name of the table containing the input data points.
-
-Data points and predefined centroids (if used) are expected to be stored row-wise,
+<dd>TEXT. The name of the table containing the input data points. Data points and
+predefined centroids (if used) are expected to be stored row-wise,
in a column of type <tt>\ref grp_svec "SVEC"</tt> (or any type convertible to
<tt>\ref grp_svec "SVEC"</tt>, like <tt>FLOAT[]</tt> or <tt>INTEGER[]</tt>).
Data points with non-finite values (NULL, NaN, infinity) in any component
@@ -111,13 +124,13 @@ are skipped during analysis.
<dd>INTEGER. The number of centroids to calculate.</dd>
<dt>fn_dist (optional)</dt>
-<dd>TEXT, default: squared_dist_norm2'. The name of the function to use to calculate the distance from a data point to a centroid.
-
+<dd>TEXT, default: 'squared_dist_norm2'. The name of the function
+to use to calculate the distance from a data point vector to a centroid vector.
The following distance functions can be used (computation of barycenter/mean in parentheses):
<ul>
-<li><b>\ref dist_norm1</b>: 1-norm/Manhattan (element-wise median
-[Note that MADlib does not provide a median aggregate function for support and
-performance reasons.])</li>
+<li><b>\ref dist_norm1</b>: 1-norm/Manhattan (element-wise median).
+MADlib does not provide a median aggregate function for
+performance reasons.</li>
<li><b>\ref dist_norm2</b>: 2-norm/Euclidean (element-wise mean)</li>
<li><b>\ref squared_dist_norm2</b>: squared Euclidean distance (element-wise mean)</li>
<li><b>\ref dist_angle</b>: angle (element-wise mean of normalized points)</li>
@@ -126,9 +139,8 @@ performance reasons.])</li>
<dt>agg_centroid (optional)</dt>
<dd>TEXT, default: 'avg'. The name of the aggregate function used to determine centroids.
-
The following aggregate functions can be used:<ul>
- <li><b>\ref avg</b>: average (Default)</li>
+ <li><b>\ref avg</b>: average (default)</li>
<li><b>\ref normalized_avg</b>: normalized average</li></ul></dd>
<dt>max_num_iterations (optional)</dt>
@@ -143,28 +155,133 @@ When fewer than this fraction of centroids are reassigned in an iteration, the c
to use for kmeans++ centroid seeding method. Kmeans++ scans through the data
sequentially 'k' times and can be too slow for big datasets. When
'seeding_sample_ratio' is greater than 0 (thresholded to be maximum value of 1.0),
-the seeding is run on an uniform random subsample of the data.
+the seeding is run on a uniform random subsample of the data.
Note: the final K-means algorithm is run on the complete dataset. This parameter
only builds a subsample for the seeding and is only available for kmeans++.
<dt>rel_initial_centroids</dt>
-<dd>TEXT. The set of initial centroids.
+<dd>TEXT. Table or view containing the set of initial centroids.
</dd>
<dt>expr_centroid</dt>
-<dd>TEXT. The name of the column (or the array expression) in the <em>rel_initial_centroids</em> relation that contains the centroid coordinates.</dd>
+<dd>TEXT. The name of the column (or the array expression)
+in the <em>rel_initial_centroids</em> table or view that contains
+the centroid coordinates.</dd>
<dt>initial_centroids</dt>
-<dd>TEXT. A string containing a DOUBLE PRECISION array expression with the initial centroid coordinates.</dd>
+<dd>DOUBLE PRECISION[][]. Array expression
+with the initial centroid coordinates.</dd>
</dl>
+<b>Output</b>
+<br>
+The output of the k-means module is a composite type with
+the following columns:
+<table class="output">
+ <tr>
+ <th>centroids</th>
+ <td>DOUBLE PRECISION[][]. The final centroid positions.</td>
+ </tr>
+ <tr>
+ <th>cluster_variance</th>
+ <td>DOUBLE PRECISION[]. The value of the objective function per cluster.</td>
+ </tr>
+ <tr>
+ <th>objective_fn</th>
+ <td>DOUBLE PRECISION. The value of the objective function.</td>
+ </tr>
+ <tr>
+ <th>frac_reassigned</th>
+ <td>DOUBLE PRECISION. The fraction of points reassigned in the last iteration.</td>
+ </tr>
+ <tr>
+ <th>num_iterations</th>
+ <td>INTEGER. The total number of iterations executed.</td>
+ </tr>
+</table>
+
+@anchor auto_cluster
+@par Auto Clustering Function
-@anchor output
-@par Output Format
+The optimal number of centroids can be determined automatically,
+from a set of candidate values that you provide, for
+random seeding or <em>k-means++</em> seeding. The <a href="#simplified_silhouette">simplified
+silhouette method</a> or the <a href="#elbow">elbow method</a> are used to pick the best <em>k</em> value.
-The output of the k-means module is a composite type with the following columns:
+- Uniform random centroid seeding method.
+<pre class="syntax">
+kmeans_random_auto( rel_source,
+ output_table,
+ expr_point,
+ k,
+ fn_dist,
+ agg_centroid,
+ max_num_iterations,
+ min_frac_reassigned,
+ k_selection_algorithm
+ )
+</pre>
+
+- k-means++ centroid seeding method.
+<pre class="syntax">
+kmeanspp_auto( rel_source,
+ output_table,
+ expr_point,
+ k,
+ fn_dist,
+ agg_centroid,
+ max_num_iterations,
+ min_frac_reassigned,
+ seeding_sample_ratio,
+ k_selection_algorithm
+ )
+</pre>
+
+<b>Arguments</b>
+<br>
+The arguments for auto k selection are the same as described above,
+with the following additions:
+<dl class="arglist">
+
+<dt>output_table</dt>
+<dd>TEXT. Name of the output table containing results for each k
+value. Details of the output table are shown below.
+A summary table called 'output_table_summary' will also be
+created for the best k value as per the selection algorithm.</dd>
+
+<dt>k</dt>
+<dd>INTEGER[]. Array of k values to test. Does not need to be
+contiguous but all elements must be >1 and cannot repeat within the array.
+Parameter 'k_selection_algorithm'
+determines the evaluation method.</dd>
+
+<dt>k_selection_algorithm (optional)</dt>
+<dd>TEXT, default: 'silhouette'. Method to evaluate optimal number of centroids k.
+Current approaches supported: 'silhouette', 'elbow' or 'both'.
+The text can be any subset of the strings; for e.g., 'silh' will
+use the silhouette method.
+Note that for large data sets, the silhouette computation
+can be expensive.</dd>
+
+@note
+Note that the auto k-means algorithms require the NumPy python package to be
+installed on the system since the elbow method uses the NumPy gradient function <a href="#kmeans-lit-2">[2]</a>.
+For Greenplum clusters, installing NumPy only on the
+host machine of the master segment will be sufficient. The suggested
+installation method is: <em>pip install numpy --user</em>
+</dl>
+
+<b>Output Tables</b>
+<br>
+Two output tables are created for auto k-means.
+The first is called 'output_table' and contains one row
+per k value:
<table class="output">
<tr>
+ <th>k</th>
+ <td>INTEGER. Number of centroids.</td>
+ </tr>
+ <tr>
<th>centroids</th>
<td>DOUBLE PRECISION[][]. The final centroid positions.</td>
</tr>
@@ -184,44 +301,188 @@ The output of the k-means module is a composite type with the following columns:
<th>num_iterations</th>
<td>INTEGER. The total number of iterations executed.</td>
</tr>
+ <tr>
+ <th>k</th>
+ <td>INTEGER. Number of centroids as per the
+ specified 'k_selection_algorithm'. If 'both' is specified,
+ the best k value will correspond to the silhouette method.</td>
+ </tr>
+ <tr>
+ <th>silhouette</th>
+ <td>DOUBLE PRECISION. The value of the simplified silhouette
+ score for the k value, if computed.</td>
+ </tr>
+ <tr>
+ <th>elbow</th>
+ <td>DOUBLE PRECISION. The value of the elbow
+ score for the k value, if computed.</td>
+ </tr>
+ <tr>
+ <th>selection_algorithm</th>
+ <td>TEXT. Algorithm used to select the best
+ k (either 'silhouette' or 'elbow')</td>
+ </tr>
</table>
+</dl>
+A summary table named <output_table>_summary is also created, which has the same output as the
+'output_table' above but only contains one row for best k value as per the selection algorithm.
+If 'both' is specified for 'k_selection_algorithm'
+the best k value returned will correspond to the silhouette method.
@anchor assignment
@par Cluster Assignment
-After training, the cluster assignment for each data point can be computed with
-the help of the following function:
+After clustering, the cluster assignment for each data point can be computed with
+the help of the closest_column() utility function:
<pre class="syntax">
-closest_column( m, x )
+closest_column( m,
+ x,
+ dist
+ )
</pre>
-<b>Argument</b>
+<b>Arguments</b>
<dl class="arglist">
<dt>m</dt>
-<dd>DOUBLE PRECISION[][]. The learned centroids from the training function.</dd>
+<dd>DOUBLE PRECISION[][]. Learned centroids from the training function.</dd>
+
<dt>x</dt>
-<dd>DOUBLE PRECISION[]. The data point.</dd>
+<dd>DOUBLE PRECISION[]. Data points.</dd>
+
+<dt>dist (optional)</dt>
+<dd>TEXT, default: 'squared_dist_norm2'. The name of the function
+to use to calculate the distance from a data point vector to a centroid vector.
+See the \e fn_dist argument of the k-means training function for more details on distance functions.</dd>
</dl>
-<b>Output format</b>
+<b>Output</b>
+<br>
+The output is a composite type with
+the following columns:
<table class="output">
-<tr>
- <th>column_id</th>
- <td>INTEGER. The cluster assignment (zero-based).</td>
-<tr>
- <th>distance</th>
- <td>DOUBLE PRECISION. The distance to the cluster centroid.</td>
+ <tr>
+ <th>column_id</th>
+ <td>INTEGER. Cluster assignment (zero-based, i.e., 0,1,2...).</td>
+ </tr>
+ <tr>
+ <th>distance</th>
+ <td>DOUBLE PRECISION. Distance to the cluster centroid.</td>
+ </tr>
+</table>
+
+@anchor silh
+@par Simple Silhouette
+
+A common method to assess the quality of the clustering is the <em>silhouette
+coefficient</em>, a simplified version of which is
+implemented in MADlib <a href="#kmeans-lit-3">[3]</a>.
+There are two silhouette functions: average score across all data points,
+and a per-point score. The average silhouette function has the following syntax:
+<pre class="syntax">
+simple_silhouette( rel_source,
+ expr_point,
+ centroids,
+ fn_dist
+ )
+</pre>
+
+The per-point silhouette function has two formats. The first
+takes an array of centroids:
+<pre class="syntax">
+simple_silhouette_points(rel_source,
+ output_table,
+ pid,
+ expr_point,
+ centroids,
+ fn_dist
+ )
+</pre>
+
+The second takes a centroids column from a table:
+<pre class="syntax">
+simple_silhouette_points(rel_source,
+ output_table,
+ pid,
+ expr_point,
+ centroids_table,
+ centroids_col,
+ fn_dist
+ )
+</pre>
+
+\b Arguments
+<dl class="arglist">
+<dt>rel_source</dt>
+<dd>TEXT. The name of the table containing the input data points.</dd>
+
+<dt>expr_point</dt>
+<dd>TEXT. The name of the column with point coordinates or an array expression.</dd>
+
+<dt>centroids</dt>
+<dd>DOUBLE PRECISION[][]. An expression evaluating to an array of centroids. </dd>
+
+<dt>fn_dist (optional)</dt>
+<dd>TEXT, default: 'squared_dist_norm2'. The name of the function
+to use to calculate the distance from a data point vector to a centroid vector.
+See the \e fn_dist argument of the k-means training function for more details on distance functions.</dd>
+
+<dt>output_table</dt>
+<dd>TEXT. Name of the output table to contain sihouette score
+for each point.</dd>
+
+<dt>pid</dt>
+<dd>TEXT. Column containing point ID in the
+table 'rel_source' containing the data points. </dd>
+
+<dt>centroids_table</dt>
+<dd>TEXT. Name of the table containing the centroids.</dd>
+
+<dt>centroids_col</dt>
+<dd>TEXT. Name of the column in the table 'centroids_table'
+containing the centroids.</dd>
+
+</dl>
+
+<b>Output</b>
+<br>
+For the average function, a single value for simple silhouette score is returned.
+For the per-point function, the output table contains
+the following columns:
+<table class="output">
+ <tr>
+ <th>pid</th>
+ <td>INTEGER. Point ID.</td>
+ </tr>
+
+ <tr>
+ <th>centroid_id</th>
+ <td>INTEGER. The cluster that the point is assigned to.</td>
+ </tr>
+
+ <tr>
+ <th>neighbor_centroid_id</th>
+ <td>INTEGER. The next closest cluster to the one that the point is assigned to.</td>
+ </tr>
+
+ <tr>
+ <th>simple_silhouette</th>
+ <td>DOUBLE PRECISION. Simplified silhouette score for the point.</td>
+ </tr>
</table>
@anchor examples
@examp
-Note: Your results may not be exactly the same as below due to the nature of the
-k-means algorithm.
+@note
+Your results may not be exactly the same as below due to the nature random
+nature of the k-means algorithm. Also, remember to be consistent in the distance
+functions that you use in the k-means, silhouette and helper functions.
+
+<h4>Clustering for Single <em>k</em> Value</h4>
--# Prepare some input data:
+-# Create input data:
<pre class="example">
DROP TABLE IF EXISTS km_sample;
CREATE TABLE km_sample(pid int, points double precision[]);
@@ -237,64 +498,102 @@ INSERT INTO km_sample VALUES
(9, '{14.83, 1.64, 2.17, 14, 97, 2.8, 2.98, 0.29, 1.98, 5.2, 1.08, 2.85, 1045}'),
(10, '{13.86, 1.35, 2.27, 16, 98, 2.98, 3.15, 0.22, 1.8500, 7.2199, 1.01, 3.55, 1045}');
</pre>
--# Run k-means clustering using kmeans++ for centroid seeding:
+-# Run k-means clustering using kmeans++ for centroid seeding. Use squared Euclidean
+distance which is a commonly used distance function.
<pre class="example">
DROP TABLE IF EXISTS km_result;
--- Run kmeans algorithm
CREATE TABLE km_result AS
-SELECT * FROM madlib.kmeanspp('km_sample', 'points', 2,
- 'madlib.squared_dist_norm2',
- 'madlib.avg', 20, 0.001);
+SELECT * FROM madlib.kmeanspp( 'km_sample', -- Table of source data
+ 'points', -- Column containing point co-ordinates
+ 2, -- Number of centroids to calculate
+ 'madlib.squared_dist_norm2', -- Distance function
+ 'madlib.avg', -- Aggregate function
+ 20, -- Number of iterations
+ 0.001 -- Fraction of centroids reassigned to keep iterating
+ );
\\x on;
SELECT * FROM km_result;
</pre>
-Result:
<pre class="result">
-centroids | {{14.036,2.018,2.536,16.56,108.6,3.004,3.03,0.298,2.038,6.10598,1.004,3.326,1340},{13.872,1.814,2.376,15.56,88.2,2.806,2.928,0.288,1.844,5.35198,1.044,3.348,988}}
-cluster_variance | {60672.638245208,90512.324426408}
-objective_fn | 151184.962671616
+-[ RECORD 1 ]----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+centroids | {{14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75},{13.7533333333333,1.905,2.425,16.0666666666667,90.3333333333333,2.805,2.98,0.29,2.005,5.40663333333333,1.04166666666667,3.31833333333333,1020.83333333333}}
+cluster_variance | {30561.74805,122999.110416013}
+objective_fn | 153560.858466013
frac_reassigned | 0
num_iterations | 2
</pre>
--# Calculate the simplified silhouette coefficient:
+-# Simplified silhouette coefficient. First find average
+for whole data set. Make sure to use the same distance function as k-means above.
<pre class="example">
-SELECT * FROM madlib.simple_silhouette( 'km_sample',
- 'points',
- (SELECT centroids FROM km_result),
- 'madlib.dist_norm2'
+SELECT * FROM madlib.simple_silhouette( 'km_sample', -- Input points table
+ 'points', -- Column containing points
+ (SELECT centroids FROM km_result), -- Centroids
+ 'madlib.squared_dist_norm2' -- Distance function
);
</pre>
-Result:
<pre class="result">
-simple_silhouette | 0.68978804882941
+-[ RECORD 1 ]-----+------------------
+simple_silhouette | 0.868174608939623
+</pre>
+Now calculate simplified silhouette coefficient for each point in the data set:
+<pre class="example">
+DROP TABLE IF EXISTS km_points_silh;
+SELECT * FROM madlib.simple_silhouette_points( 'km_sample', -- Input points table
+ 'km_points_silh', -- Output table
+ 'pid', -- Point ID column in input table
+ 'points', -- Points column in input table
+ 'km_result', -- Centroids table
+ 'centroids', -- Column in centroids table containing centroids
+ 'madlib.squared_dist_norm2' -- Distance function
+ );
+SELECT * FROM km_points_silh ORDER BY pid;
+</pre>
+<pre class="result">
+ pid | centroid_id | neighbor_centroid_id | silh
+-----+-------------+----------------------+-------------------
+ 1 | 1 | 0 | 0.966608819821713
+ 2 | 1 | 0 | 0.926251125077039
+ 3 | 1 | 0 | 0.28073008848306
+ 4 | 0 | 1 | 0.951447083910869
+ 5 | 1 | 0 | 0.80098239014753
+ 6 | 0 | 1 | 0.972487557020722
+ 7 | 0 | 1 | 0.88838568723116
+ 8 | 0 | 1 | 0.906334719972002
+ 9 | 1 | 0 | 0.994315140692314
+ 10 | 1 | 0 | 0.99420347703982
+(10 rows)
</pre>
--# Find the cluster assignment for each point:
+-# Find the cluster assignment for each point.
+Use the closest_column() function to map each point to the cluster that it belongs to.
<pre class="example">
\\x off;
--- Get point assignment
-SELECT data.*, (madlib.closest_column(centroids, points)).column_id as cluster_id
-FROM km_sample as data, km_result
-ORDER BY data.pid;
+DROP TABLE IF EXISTS point_cluster_map;
+CREATE TABLE point_cluster_map AS
+SELECT data.*, (madlib.closest_column(centroids, points, 'madlib.squared_dist_norm2')).*
+FROM km_sample as data, km_result;
+ALTER TABLE point_cluster_map RENAME column_id to cluster_id; -- change column name
+SELECT * FROM point_cluster_map ORDER BY pid;
</pre>
-Result:
<pre class="result">
- pid | points | cluster_id
------+--------------------------------------------------------------------+------------
- 1 | {14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065} | 1
- 2 | {13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050} | 1
- 3 | {13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185} | 0
- 4 | {14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480} | 0
- 5 | {13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735} | 1
- 6 | {14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450} | 0
- 7 | {14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290} | 0
- 8 | {14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295} | 0
- 9 | {14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045} | 1
- 10 | {13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.2199,1.01,3.55,1045} | 1
+ pid | points | cluster_id | distance
+-----+--------------------------------------------------------------------+------------+------------------
+ 1 | {14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065} | 1 | 3296.12614333444
+ 2 | {13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050} | 1 | 8856.90882600111
+ 3 | {13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185} | 1 | 27072.3216580044
+ 4 | {14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480} | 0 | 10261.9450375
+ 5 | {13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735} | 1 | 82492.8673553345
+ 6 | {14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450} | 0 | 5080.3612375
+ 7 | {14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290} | 0 | 8090.4485875
+ 8 | {14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295} | 0 | 7128.9931875
+ 9 | {14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045} | 1 | 634.301947334443
+ 10 | {13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.2199,1.01,3.55,1045} | 1 | 646.584486004443
(10 rows)
</pre>
--# Unnest the cluster centroids 2-D array to get a set of 1-D centroid arrays:
+-# Display cluster ID. There are two steps to get the cluster id associated
+with the centroid coordinates, if you need it. First unnest the cluster
+centroids 2-D array to get a set of 1-D centroid arrays:
<pre class="example">
DROP TABLE IF EXISTS km_centroids_unnest;
-- Run unnest function
@@ -303,31 +602,24 @@ SELECT (madlib.array_unnest_2d_to_1d(centroids)).*
FROM km_result;
SELECT * FROM km_centroids_unnest ORDER BY 1;
</pre>
-Result:
<pre class="result">
- unnest_row_id | unnest_result
----------------+----------------------------------------------------------------------------------
- 1 | {14.036,2.018,2.536,16.56,108.6,3.004,3.03,0.298,2.038,6.10598,1.004,3.326,1340}
- 2 | {13.872,1.814,2.376,15.56,88.2,2.806,2.928,0.288,1.844,5.35198,1.044,3.348,988}
+ unnest_row_id | unnest_result
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 1 | {14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75}
+ 2 | {13.7533333333333,1.905,2.425,16.0666666666667,90.3333333333333,2.805,2.98,0.29,2.005,5.40663333333333,1.04166666666667,3.31833333333333,1020.83333333333}
(2 rows)
</pre>
-Note that the ID column returned by array_unnest_2d_to_1d()
-is not guaranteed to be the same as the cluster ID assigned by k-means.
-See below to create the correct cluster IDs.
-
--# Create cluster IDs for 1-D centroid arrays so that cluster ID for any centroid
-can be matched to the cluster assignment for the data points:
+Note that the ID column returned by array_unnest_2d_to_1d() is just a row ID and not the cluster ID assigned by k-means. The second step to get the cluster_id is:
<pre class="example">
-SELECT cent.*, (madlib.closest_column(centroids, unnest_result)).column_id as cluster_id
+SELECT cent.*, (madlib.closest_column(centroids, unnest_result, 'madlib.squared_dist_norm2')).column_id as cluster_id
FROM km_centroids_unnest as cent, km_result
ORDER BY cent.unnest_row_id;
</pre>
-Result:
<pre class="result">
- unnest_row_id | unnest_result | cluster_id
----------------+----------------------------------------------------------------------------------+------------
- 1 | {14.036,2.018,2.536,16.56,108.6,3.004,3.03,0.298,2.038,6.10598,1.004,3.326,1340} | 0
- 2 | {13.872,1.814,2.376,15.56,88.2,2.806,2.928,0.288,1.844,5.35198,1.044,3.348,988} | 1
+ unnest_row_id | unnest_result | cluster_id
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------
+ 1 | {14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75} | 0
+ 2 | {13.7533333333333,1.905,2.425,16.0666666666667,90.3333333333333,2.805,2.98,0.29,2.005,5.40663333333333,1.04166666666667,3.31833333333333,1020.83333333333} | 1
(2 rows)
</pre>
@@ -360,24 +652,25 @@ INSERT INTO km_arrayin VALUES
(9, 14.83, 1.64, 2.17, 14, 97, 2.8, 2.98, 0.29, 1.98, 5.2, 1.08, 2.85, 1045),
(10, 13.86, 1.35, 2.27, 16, 98, 2.98, 3.15, 0.22, 1.8500, 7.2199, 1.01, 3.55, 1045);
</pre>
-Now find the cluster assignment for each point:
+Now find the cluster assignment for each point using random seeding:
<pre class="example">
-DROP TABLE IF EXISTS km_result;
+DROP TABLE IF EXISTS km_result_array;
-- Run kmeans algorithm
-CREATE TABLE km_result AS
-SELECT * FROM madlib.kmeans_random('km_arrayin',
- 'ARRAY[p1, p2, p3, p4, p5, p6,
+CREATE TABLE km_result_array AS
+SELECT * FROM madlib.kmeans_random('km_arrayin', -- Table of source data
+ 'ARRAY[p1, p2, p3, p4, p5, p6, -- Points
p7, p8, p9, p10, p11, p12, p13]',
- 2,
- 'madlib.squared_dist_norm2',
- 'madlib.avg',
- 20,
- 0.001);
+ 2, -- Number of centroids to calculate
+ 'madlib.squared_dist_norm2', -- Distance function
+ 'madlib.avg', -- Aggregate function
+ 20, -- Number of iterations
+ 0.001); -- Fraction of centroids reassigned to keep iterating
-- Get point assignment
SELECT data.*, (madlib.closest_column(centroids,
ARRAY[p1, p2, p3, p4, p5, p6,
- p7, p8, p9, p10, p11, p12, p13])).column_id as cluster_id
-FROM km_arrayin as data, km_result
+ p7, p8, p9, p10, p11, p12, p13],
+ 'madlib.squared_dist_norm2')).column_id as cluster_id
+FROM km_arrayin as data, km_result_array
ORDER BY data.pid;
</pre>
This produces the result in column format:
@@ -386,7 +679,7 @@ This produces the result in column format:
-----+-------+------+------+------+-----+------+------+------+------+--------+------+------+------+------------
1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.8 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 | 0
2 | 13.2 | 1.78 | 2.14 | 11.2 | 1 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.49 | 1050 | 0
- 3 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.8 | 3.24 | 0.3 | 2.81 | 5.6799 | 1.03 | 3.17 | 1185 | 0
+ 3 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.8 | 3.24 | 0.3 | 2.81 | 5.6799 | 1.03 | 3.17 | 1185 | 1
4 | 14.37 | 1.95 | 2.5 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.8 | 0.86 | 3.45 | 1480 | 1
5 | 13.24 | 2.59 | 2.87 | 21 | 118 | 2.8 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 | 0
6 | 14.2 | 1.76 | 2.45 | 15.2 | 112 | 3.27 | 3.39 | 0.34 | 1.97 | 6.75 | 1.05 | 2.85 | 1450 | 1
@@ -397,79 +690,202 @@ This produces the result in column format:
(10 rows)
</pre>
-@anchor notes
-@par Notes
-
-The algorithm stops when one of the following conditions is met:
-- The fraction of updated points is smaller than the convergence threshold
-(<em>min_frac_reassigned</em> argument). (Default: 0.001).
-- The algorithm reaches the maximum number of allowed iterations
-(<em>max_num_iterations</em> argument). (Default: 20).
+<h4>Auto Clustering for Range of <em>k</em> Values</h4>
-A popular method to assess the quality of the clustering is the <em>silhouette
-coefficient</em>, a simplified version of which is provided as part of the
-k-means module. Note that for large data sets, this computation is expensive.
-
-The silhouette function has the following syntax:
-<pre class="syntax">
-simple_silhouette( rel_source,
- expr_point,
- centroids,
- fn_dist
- )
+-# Auto k selection. Now let's run k-means random for a variety of k values
+and compare using simple silhouette and elbow methods.
+<pre class="example">
+DROP TABLE IF EXISTS km_result_auto, km_result_auto_summary;
+SELECT madlib.kmeans_random_auto(
+ 'km_sample', -- points table
+ 'km_result_auto', -- output table
+ 'points', -- column name in point table
+ ARRAY[2,3,4,5,6], -- k values to try
+ 'madlib.squared_dist_norm2', -- distance function
+ 'madlib.avg', -- aggregate function
+ 20, -- max iterations
+ 0.001, -- minimum fraction of centroids reassigned to continue iterating
+ 'both' -- k selection algorithm (simple silhouette and elbow)
+);
+\\x on
+SELECT * FROM km_result_auto_summary;
+</pre>
+<pre class="result">-[ RECORD 1 ]-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [...]
+k | 6
+centroids | {{14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065},{13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735},{14.285,1.855,2.475,16,112.5,3.56,3.44,0.29,2.075,7.275,0.955,3.15,1465},{13.87,2.12666666666667,2.57666666666667,16.9333333333333,106,2.63333333333333,2.75666666666667,0.303333333333333,2.01333333333333,5.32663333333333,1.03666666666667,3.44333333333333,1256.66666666667},{14.345,1.495,2.22,15,97.5,2.89,3.065,0.255,1.915,6.20995,1.045,3.2 [...]
+cluster_variance | {0,0,452.7633,8078.22646267333,5.346498005,0}
+objective_fn | 8536.33626067833
+frac_reassigned | 0
+num_iterations | 3
+silhouette | 0.954496117675027
+elbow | -50296.3677976633
+selection_algorithm | silhouette
+</pre>
+The best selection above is made by the silhouette algorithm by default.
+Note that the elbow method may select a different k value as the best.
+To see results for all k values:
+<pre class="example">
+SELECT * FROM km_result_auto ORDER BY k;
+</pre>
+<pre class="result">
+-[ RECORD 1 ]----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ [...]
+k | 2
+centroids | {{14.036,2.018,2.536,16.56,108.6,3.004,3.03,0.298,2.038,6.10598,1.004,3.326,1340},{13.872,1.814,2.376,15.56,88.2,2.806,2.928,0.288,1.844,5.35198,1.044,3.348,988}}
+cluster_variance | {60672.638245208,90512.324426408}
+objective_fn | 151184.962671616
+frac_reassigned | 0
+num_iterations | 3
+silhouette | 0.872087020146542
+elbow | -1056.25028126836
+-[ RECORD 2 ]----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ [...]
+k | 3
+centroids | {{13.87,2.12666666666667,2.57666666666667,16.9333333333333,106,2.63333333333333,2.75666666666667,0.303333333333333,2.01333333333333,5.32663333333333,1.03666666666667,3.44333333333333,1256.66666666667},{14.285,1.855,2.475,16,112.5,3.56,3.44,0.29,2.075,7.275,0.955,3.15,1465},{13.872,1.814,2.376,15.56,88.2,2.806,2.928,0.288,1.844,5.35198,1.044,3.348,988}}
+cluster_variance | {8078.22646267333,452.7633,90512.324426408}
+objective_fn | 99043.3141890814
+frac_reassigned | 0
+num_iterations | 2
+silhouette | 0.895849874221733
+elbow | 20549.7753189989
+-[ RECORD 3 ]----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ [...]
+k | 4
+centroids | {{14.02,1.765,2.385,16.05,105.75,2.845,3.1075,0.2725,2.2325,5.93495,1.04,3.3725,1085},{13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735},{14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75},{13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050}}
+cluster_variance | {14227.41709401,0,30561.74805,0}
+objective_fn | 44789.16514401
+frac_reassigned | 0
+num_iterations | 3
+silhouette | 0.877839150000438
+elbow | 17535.7421610686
+-[ RECORD 4 ]----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ [...]
+k | 5
+centroids | {{14.285,1.855,2.475,16,112.5,3.56,3.44,0.29,2.075,7.275,0.955,3.15,1465},{14.225,2.01,2.53,16.1,108.5,2.55,2.515,0.305,1.615,5.15,1.04,3.58,1292.5},{13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050},{14.04,1.8225,2.435,16.65,110,2.845,2.97,0.295,1.985,5.594975,1.0425,3.3125,972.5},{13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185}}
+cluster_variance | {452.7633,329.8988,0,76176.4564000075,0}
+objective_fn | 76959.1185000075
+frac_reassigned | 0
+num_iterations | 2
+silhouette | 0.771207558995578
+elbow | -28690.3421973961
+-[ RECORD 5 ]----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ [...]
+k | 6
+centroids | {{14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065},{13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735},{14.285,1.855,2.475,16,112.5,3.56,3.44,0.29,2.075,7.275,0.955,3.15,1465},{13.87,2.12666666666667,2.57666666666667,16.9333333333333,106,2.63333333333333,2.75666666666667,0.303333333333333,2.01333333333333,5.32663333333333,1.03666666666667,3.44333333333333,1256.66666666667},{14.345,1.495,2.22,15,97.5,2.89,3.065,0.255,1.915,6.20995,1.045,3.2,10 [...]
+cluster_variance | {0,0,452.7633,8078.22646267333,5.346498005,0}
+objective_fn | 8536.33626067833
+frac_reassigned | 0
+num_iterations | 3
+silhouette | 0.954496117675027
+elbow | -50296.3677976633
+</pre>
+-# Simplified silhouette per point. Let's say we want the simplified silhouette
+coefficient for each point in the data set, for the case where <em>k=3</em>:
+<pre class="example">
+DROP TABLE IF EXISTS km_points_silh;
+SELECT * FROM madlib.simple_silhouette_points( 'km_sample', -- Input points table
+ 'km_points_silh', -- Output table
+ 'pid', -- Point ID column in input table
+ 'points', -- Points column in input table
+ (SELECT centroids FROM km_result_auto WHERE k=3), -- centroids array
+ 'madlib.squared_dist_norm2' -- Distance function
+ );
+SELECT * FROM km_points_silh ORDER BY pid;
+</pre>
+<pre class="result">
+ pid | centroid_id | neighbor_centroid_id | silh
+-----+-------------+----------------------+-------------------
+ 1 | 2 | 0 | 0.800019825058391
+ 2 | 2 | 0 | 0.786712987562913
+ 3 | 0 | 2 | 0.867496080386644
+ 4 | 1 | 0 | 0.995466498952947
+ 5 | 2 | 0 | 0.761551610381542
+ 6 | 1 | 0 | 0.993950286967157
+ 7 | 0 | 1 | 0.960621637528528
+ 8 | 0 | 1 | 0.941493577405087
+ 9 | 2 | 0 | 0.925822020308802
+ 10 | 2 | 0 | 0.92536421766532
+(10 rows)
</pre>
-\b Arguments
-<dl class="arglist">
-<dt>rel_source</dt>
-<dd>TEXT. The name of the relation containing the input point.</dd>
-<dt>expr_point</dt>
-<dd>TEXT. An expression evaluating to point coordinates for each row in the relation.</dd>
-<dt>centroids</dt>
-<dd>TEXT. An expression evaluating to an array of centroids. </dd>
-<dt>fn_dist (optional)</dt>
-<dd>TEXT, default 'dist_norm2', The name of a function to calculate the distance of a point from a centroid. See the \e fn_dist argument of the k-means training function.</dd>
-</dl>
-
@anchor background
@par Technical Background
+<b>k-means Algorithm</b>
+<br>
Formally, we wish to minimize the following objective function:
\f[
(c_1, \dots, c_k) \mapsto \sum_{i=1}^n \min_{j=1}^k \operatorname{dist}(x_i, c_j)
\f]
In the most common case, \f$ \operatorname{dist} \f$ is the square of the
-Euclidean distance.
+Euclidean distance, though other distance measures can be used.
This problem is computationally difficult (NP-hard), yet the
-local-search heuristic proposed by Lloyd [4] performs reasonably well in
+local-search heuristic proposed by Lloyd <a href="#kmeans-lit-4">[4]</a> performs reasonably well in
practice. In fact, it is so ubiquitous today that it is
often referred to as the <em>standard algorithm</em> or even just the
-<em>k-means algorithm</em> [1]. It works as follows:
+<em>k-means algorithm</em>. It works as follows:
--# Seed the \f$ k \f$ centroids (see below)
+-# Seed the \f$ k \f$ centroids, meaning specify their initial positions (see below).
-# Repeat until convergence:
- -# Assign each point to its closest centroid
+ -# Assign each point to its closest centroid.
-# Move each centroid to a position that minimizes the sum of distances in this
- cluster
+ cluster.
-# Convergence is achieved when no points change their assignments during step
2a.
Since the objective function decreases in every step, this algorithm is
guaranteed to converge to a local optimum.
+The quality of k-means is highly influenced by the choice of the seeding.
+In MADlib we support uniform-at-random sampling, kmeans++ as well as the
+ability for the user to enter manual seeding based on other methods.
+
+k-means++ seeding <a href="#kmeans-lit-1">[1]</a> starts with a single centroid chosen randomly among the input points. It then
+iteratively chooses new centroids from the input points until there are a total of k centroids. The
+probability for picking a particular point is proportional to its minimum distance to any existing
+centroid. Intuitively, k-means++ favors seedings where centroids are spread out over the whole
+range of the input points, while at the same time not being too susceptible to outliers.
+
+@anchor simplified_silhouette
+<b>Silhouette</b>
+<br>
+
+To evaluate the validity of clustering with different k values,
+the objective function is not ideal because it decreases as
+k value increases, so it has a bias
+toward selecting the largest k as the best result <a href="#kmeans-lit-6">[6]</a>.
+Therefore we use other internal measures to evaluate cluster validity.
+
+One such measure is silhouette score. The original formulation <a href="#kmeans-lit-7">[7]</a>
+computes the average distance of a data
+point to all the other data points in its own cluster, and to all the data points
+in the neighbouring cluster nearest to the data point.
+This is expensive for a large number of points
+since it requires the full pairwise distance matrix over all data.
+
+In the simplified silhouette <a href="#kmeans-lit-3">[3]</a> which is used in MADlib,
+the distance of a data point to a cluster
+is represented with the distance to the cluster centroid instead of the
+average distance to all (other) data points in the cluster.
+This has the advantage of being much faster to compute, and can be shown
+to be comparable in performance to the full silhouette <a href="#kmeans-lit-8">[8]</a>.
+
+@anchor elbow
+<b>Elbow Method</b>
+<br>
+The elbow method considers the percentage of variance explained as a function of number of clusters.
+The idea is not to add another cluster if it doesn't model the data better.
+Graphically it means identifying the "elbow" in the curve of sum of squared errors
+vs. number of clusters (k). This was possibly originally suggested in <a href="#kmeans-lit-9">[9]</a>.
+To locate the elbow, we use the 2nd derivative of the
+objective function using the NumPy gradient function <a href="#kmeans-lit-2">[2]</a>.
+
@anchor literature
@literature
@anchor kmeans-lit-1
-[1] Wikipedia, K-means Clustering,
- http://en.wikipedia.org/wiki/K-means_clustering
+[1] David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful
+ seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete
+ Algorithms (SODA'07), pp. 1027-1035.
@anchor kmeans-lit-2
-[2] David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful
- seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete
- Algorithms (SODA'07), pp. 1027-1035,
- http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf
+[2] NumPy gradient function, https://docs.scipy.org/doc/numpy/reference/generated/numpy.gradient.html
@anchor kmeans-lit-3
[3] E. R. Hruschka, L. N. C. Silva, R. J. G. B. Campello: Clustering
@@ -486,6 +902,22 @@ guaranteed to converge to a local optimum.
[5] Leisch, Friedrich: A Toolbox for K-Centroids Cluster Analysis. In: Computational
Statistics and Data Analysis, 51(2). pp. 526-544. 2006.
+@anchor kmeans-lit-6
+[6] Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern recognition
+letters 31(8), 651–666 (2010)
+
+@anchor kmeans-lit-7
+[7] Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation
+and Validation of Cluster Analysis”. Computational and Applied Mathematics 20: 53-65
+
+@anchor kmeans-lit-8
+[8] Wang F., Franco-Penya HH., Kelleher J.D., Pugh J., Ross R. (2017) An Analysis
+of the Application of Simplified Silhouette to the Evaluation of k-means Clustering
+Validity. In: Perner P. (eds) Machine Learning and Data Mining in Pattern Recognition.
+MLDM 2017. Lecture Notes in Computer Science, vol 10358. Springer, Cham
+
+@anchor kmeans-lit-9
+[9] Robert L. Thorndike (December 1953). "Who Belongs in the Family?". Psychometrika. 18 (4): 267–276. doi:10.1007/BF02289263.
@anchor related
@par Related Topics
@@ -494,8 +926,9 @@ File kmeans.sql_in documenting the k-Means SQL functions
\ref grp_svec
-\ref simple_silhouette()
+closest_column()
+array_unnest_2d_to_1d()
@internal
@sa namespace kmeans (documenting the implementation in Python)