You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by jb...@apache.org on 2019/12/14 21:02:16 UTC
[lucene-solr] branch visual-guide updated: Visual Guide: Improve Kmeans docs

This is an automated email from the ASF dual-hosted git repository.

jbernste pushed a commit to branch visual-guide
in repository https://gitbox.apache.org/repos/asf/lucene-solr.git


The following commit(s) were added to refs/heads/visual-guide by this push:
     new 3abe952  Visual Guide: Improve Kmeans docs
3abe952 is described below

commit 3abe9520d4df9583018515eddc9e4dda274ff1e0
Author: Joel Bernstein <jb...@apache.org>
AuthorDate: Sat Dec 14 16:01:38 2019 -0500

    Visual Guide: Improve Kmeans docs
---
 .../src/images/math-expressions/2DCluster1.png     | Bin 533546 -> 449150 bytes
 solr/solr-ref-guide/src/machine-learning.adoc      |  23 ++++++++++++++++-----
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/solr/solr-ref-guide/src/images/math-expressions/2DCluster1.png b/solr/solr-ref-guide/src/images/math-expressions/2DCluster1.png
index 08daacf..9feea22 100644
Binary files a/solr/solr-ref-guide/src/images/math-expressions/2DCluster1.png and b/solr/solr-ref-guide/src/images/math-expressions/2DCluster1.png differ
diff --git a/solr/solr-ref-guide/src/machine-learning.adoc b/solr/solr-ref-guide/src/machine-learning.adoc
index 09b626a..8158a26 100644
--- a/solr/solr-ref-guide/src/machine-learning.adoc
+++ b/solr/solr-ref-guide/src/machine-learning.adoc
@@ -432,7 +432,16 @@ for examining and visualizing the clusters and centroids.
 
 ==== Clustered Scatter Plot
 
-In this example the `random` function draws a sample of records from the nyc311 (complaints database) collection where
+In this example we'll again be clustering 2D lat/lon points of rat sightings. But unlike the DBSCAN example, K-Means clustering
+does not on its own
+perform any noise reduction. So in order to reduce the noise a smaller random sample is selected from the data than was used
+for the DBSCAN example.
+
+We'll see that sampling itself is a powerful noise reduction tool which helps visualize the cluster density.
+This is because there is a higher probability that samples will be drawn from higher density clusters and a lower
+probability that samples will be drawn from lower density clusters.
+
+In this example the `random` function draws a sample of 1500 records from the nyc311 (complaints database) collection where
 the complaint description matches "rat sighting" and latitude is populated in the record. The latitude and longitude fields
 are then vectorized and added as rows to a matrix. The matrix is transposed so each row contains a single latitude, longitude
 point. The `kmeans` function is then used to cluster the latitude and longitude points into 21 clusters.
@@ -443,10 +452,14 @@ image::images/math-expressions/2DCluster1.png[]
 
 The scatter plot above shows each lat/lon point plotted on a Euclidean plain with longitude on the
 *x* axis and
-latitude on *y* axis. Each cluster is shown in a different color. This plot provides interesting
-insight into the clusters of rat sightings throughout the five boroughs of New York City. For
-example it highlights a cluster of dense sightings in Brooklyn at cluster5 and cluster17,
-surrounded by less dense clusters.
+latitude on *y* axis. The plot is dense enough so the outlines of the different boroughs are visible
+if you know the boroughs of New York City.
+
+
+Each cluster is shown in a different color. This plot provides interesting
+insight into the densities of rat sightings throughout the five boroughs of New York City. For
+example it highlights a cluster of dense sightings in Brooklyn at cluster1
+surrounded by less dense but still high activity clusters.
 
 ==== Plotting the Centroids