You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by jb...@apache.org on 2019/10/02 12:57:25 UTC
[lucene-solr] branch SOLR-13105-visual updated: SOLR-13105: Update
machine learning docs 10
This is an automated email from the ASF dual-hosted git repository.
jbernste pushed a commit to branch SOLR-13105-visual
in repository https://gitbox.apache.org/repos/asf/lucene-solr.git
The following commit(s) were added to refs/heads/SOLR-13105-visual by this push:
new 4481e5b SOLR-13105: Update machine learning docs 10
4481e5b is described below
commit 4481e5ba9f94f1182cc228653d722bc09689a7df
Author: Joel Bernstein <jb...@apache.org>
AuthorDate: Wed Oct 2 08:57:17 2019 -0400
SOLR-13105: Update machine learning docs 10
---
solr/solr-ref-guide/src/machine-learning.adoc | 98 ++++++++++++++++-----------
1 file changed, 60 insertions(+), 38 deletions(-)
diff --git a/solr/solr-ref-guide/src/machine-learning.adoc b/solr/solr-ref-guide/src/machine-learning.adoc
index ad76eba..b107391 100644
--- a/solr/solr-ref-guide/src/machine-learning.adoc
+++ b/solr/solr-ref-guide/src/machine-learning.adoc
@@ -633,7 +633,7 @@ that the features are all bigram phrases with semantic significance to the resul
let(a=select(search(reviews, q="review_t:\"star wars\"", rows="500"),
id,
analyze(review_t, text_bigrams) as terms),
- vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have,',what"),
+ vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have"),
clusters=kmeans(vectors, 5),
centroids=getCentroids(clusters),
phrases=topFeatures(centroids, 5))
@@ -713,7 +713,7 @@ rather than a single trial of the `kmeans` function.
let(a=select(search(reviews, q="review_t:\"star wars\"", rows="500"),
id,
analyze(review_t, text_bigrams) as terms),
- vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have,',what"),
+ vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have"),
clusters=multiKmeans(vectors, 5, 15),
centroids=getCentroids(clusters),
phrases=topFeatures(centroids, 5))
@@ -781,34 +781,33 @@ is a value between 1 and 2 that determines how fuzzy to make the cluster assignm
After the clustering has been performed the `getMembershipMatrix` function can be called
on the clustering result to return a matrix describing which clusters each vector belongs to.
-There is a row in the matrix for each vector that was clustered. There is a column in the matrix
-for each cluster. The values in the columns are the probability that the vector belonged to the specific
-cluster.
+This matrix can be used to understand relationships between clusters.
-A simple example will make this more clear. In the example below 300 documents are analyzed and
-then turned into a term vector matrix. Then the `fuzzyKmeans` function clusters the
-term vectors into 12 clusters with a fuzziness factor of 1.25.
+In the example below `fuzzyKmeans` is used to cluster the movie reviews matching the phrase "star wars".
+But instead of looking at the clusters or centroids the `getMembershipMatrix` is used to return the
+membership probabilities for each document. The membership matrix is comprised of a row for each
+vector that was clustered. There is a column in the matrix for each cluster.
+The values in the matrix are the probability that the vector belongs to a specific cluster.
+
+In the example the `corr` function is used to create a *correlation matrix* of the columns of the
+membership matrix. In other words the correlation matrix shows the correlation of the clusters
+based on the document co-occurrence in the clusters.
+
+Notice that in the example cluster3 and cluster5 are very highly correlated, which means that
+many documents had a probability of occurring in both clusters. Further analysis of the key features
+in both clusters can done to understand the reason how these cluster are interconnected.
[source,text]
----
-let(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"),
- id,
- analyze(body, body_bigram) as terms),
- b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
- c=fuzzyKmeans(b, 12, fuzziness=1.25),
- d=getMembershipMatrix(c), <1>
- e=rowAt(d, 0), <2>
- f=precision(e, 5)) <3>
+let(a=select(search(reviews, q="text_t:\"star wars\"", rows="500"),
+ id,
+ analyze(text_t, body) as terms),
+ vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have"),
+ clusters=fuzzyKmeans(vectors, 5, fuzziness=1.3),
+ m=getMembershipMatrix(clusters),
+ corr=corr(m))
----
-<1> The `getMembershipMatrix` function is used to return the membership matrix;
-<2> and the first row of membership matrix is retrieved with the `rowAt` function.
-<3> The `precision` function is then applied to the first row
-of the matrix to make it easier to read.
-
-This expression returns a single vector representing the cluster membership probabilities for the first
-term vector. Notice that the term vector has the highest association with the 12^th^ cluster,
-but also has significant associations with the 3^rd^, 5^th^, 6^th^ and 7^th^ clusters:
[source,json]
----
@@ -816,24 +815,47 @@ but also has significant associations with the 3^rd^, 5^th^, 6^th^ and 7^th^ clu
"result-set": {
"docs": [
{
- "f": [
- 0,
- 0,
- 0.178,
- 0,
- 0.17707,
- 0.17775,
- 0.16214,
- 0,
- 0,
- 0,
- 0,
- 0.30504
+ "corr": [
+ [
+ 1,
+ -0.3107483649904961,
+ -0.01238925922725737,
+ -0.034546141301127015,
+ -0.012389261961639414
+ ],
+ [
+ -0.3107483649904961,
+ 1,
+ -0.7752380698457411,
+ -0.49268725855405776,
+ -0.7752380691584819
+ ],
+ [
+ -0.01238925922725737,
+ -0.7752380698457411,
+ 1,
+ -0.0508166330303757,
+ 0.9999999999999954
+ ],
+ [
+ -0.034546141301127015,
+ -0.49268725855405776,
+ -0.0508166330303757,
+ 1,
+ -0.05081663258795273
+ ],
+ [
+ -0.012389261961639414,
+ -0.7752380691584819,
+ 0.9999999999999954,
+ -0.05081663258795273,
+ 1
+ ]
]
},
{
"EOF": true,
- "RESPONSE_TIME": 2157
+ "RESPONSE_TIME": 245
}
]
}