You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by jb...@apache.org on 2019/10/11 14:25:40 UTC
[lucene-solr] branch SOLR-13105-visual updated: SOLR-13105: Improve ML docs 20

This is an automated email from the ASF dual-hosted git repository.

jbernste pushed a commit to branch SOLR-13105-visual
in repository https://gitbox.apache.org/repos/asf/lucene-solr.git


The following commit(s) were added to refs/heads/SOLR-13105-visual by this push:
     new 3caa09d  SOLR-13105: Improve ML docs 20
3caa09d is described below

commit 3caa09de231e0b65010718265732cf84cb909460
Author: Joel Bernstein <jb...@apache.org>
AuthorDate: Fri Oct 11 10:25:12 2019 -0400

    SOLR-13105: Improve ML docs 20
---
 .../src/images/math-expressions/redwine1.png       | Bin 0 -> 294849 bytes
 .../src/images/math-expressions/redwine2.png       | Bin 0 -> 269037 bytes
 .../src/images/math-expressions/sined.png          | Bin 248623 -> 268677 bytes
 solr/solr-ref-guide/src/machine-learning.adoc      | 224 ++++++++-------------
 4 files changed, 82 insertions(+), 142 deletions(-)

diff --git a/solr/solr-ref-guide/src/images/math-expressions/redwine1.png b/solr/solr-ref-guide/src/images/math-expressions/redwine1.png
new file mode 100644
index 0000000..2b7074a
Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/redwine1.png differ
diff --git a/solr/solr-ref-guide/src/images/math-expressions/redwine2.png b/solr/solr-ref-guide/src/images/math-expressions/redwine2.png
new file mode 100644
index 0000000..c876955
Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/redwine2.png differ
diff --git a/solr/solr-ref-guide/src/images/math-expressions/sined.png b/solr/solr-ref-guide/src/images/math-expressions/sined.png
index a41d73d..9e99e09 100644
Binary files a/solr/solr-ref-guide/src/images/math-expressions/sined.png and b/solr/solr-ref-guide/src/images/math-expressions/sined.png differ
diff --git a/solr/solr-ref-guide/src/machine-learning.adoc b/solr/solr-ref-guide/src/machine-learning.adoc
index 0ce2be9..de2e425 100644
--- a/solr/solr-ref-guide/src/machine-learning.adoc
+++ b/solr/solr-ref-guide/src/machine-learning.adoc
@@ -134,7 +134,6 @@ decreases.
 image::images/math-expressions/distance.png[]
 
 
-
 == knnSearch
 
 The `knnSearch` function returns the k-nearest neighbors
@@ -256,180 +255,121 @@ of K (nearest neighbors), the smoother the line.
 
 === Multivariate Non-Linear Regression
 
-The `knnRegress` function prepares the training set for use with the `predict` function.
+The `knnRegress` function is also powerful and flexible tool for
+multi-variate non-linear regression.
 
-Below is an example of the `knnRegress` function. In this example 10,000 random samples
-are taken, each containing the variables `filesize_d`, `service_d` and `response_d`. The pairs of
-`filesize_d` and `service_d` will be used to predict the value of `response_d`.
+In the example below a multi-variate regression is demonstrated using
+a database designed for analyzing and predicting wine quality. The
+A database contains nearly 1600 records with 9 predictors of wine quality:
+pH, alcohol, fixed_acidity, sulphates, density, free_sulfur_dioxide,
+volatile_acidity, citric_acid, residual_sugar. There is also a field
+called quality which was ranking assigned to each wine ranging
+from 3 to 8.
 
+Using `knnRegress` we can predict wine quality for vectors containing
+the predictor values.
 
-[source,text]
-----
-let(a=random(logs, q="*:*", rows="500", fl="filesize_d,  load_d, eresponse_d"),
-     x=col(a, filesize_d),
-     y=col(a, load_d),
-     z=col(a, eresponse_d),
-     obs=transpose(matrix(x, y)),
-     r=knnRegress(obs, z , 20))
-----
+In the example a search is performed on the *redwine* collection to
+bring back all the rows in the database. Then the quality and each
+predictor field are read into vectors and set to variables.
 
-This expression returns the following response. Notice that `knnRegress` returns a tuple describing the regression inputs:
+The predictor variables are then added as rows to a matrix which is
+transposed so each row in the matrix contains one observation of
+predictors. This is our observation matrix which is assigned to variable
+*obs*.
 
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "lazyModel": {
-          "features": 2,
-          "robust": false,
-          "distance": "EuclideanDistance",
-          "observations": 10000,
-          "scale": false,
-          "k": 5
-        }
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 170
-      }
-    ]
-  }
-}
-----
+Then the `knnRegress` function sets up the regression model for
+predicting quality based on the observation data. The value for K is
+set to 5 in the example, so the average quality of 5 nearest neighbors
+will be used to calculate the quality.
 
-=== Prediction and Residuals
+The `predict` function is then used to generate a vector of predictions
+for the entire observation set. These predictions will be used to determine
+how well the KNN regression performed over the training set.
 
-The output of `knnRegress` can be used with the `predict` function like other regression models.
+The error or *residuals* for the regression is then calculated by
+subtracting the predicted quality from the observed quality.
+The `ebeSubtract` function is used to perform the element-by-element
+vector subtraction.
 
-In the example below the `predict` function is used to predict results for the original training
-data. The sumSq of the residuals is then calculated.
+Finally the `zplot` function formats the predictions and errors for
+for visualization of the *residuals plot*.
 
-[source,text]
-----
-let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
-    filesizes=col(samples, filesize_d),
-    serviceLevels=col(samples, service_d),
-    outcomes=col(samples, response_d),
-    observations=transpose(matrix(filesizes, serviceLevels)),
-    lazyModel=knnRegress(observations, outcomes , 5),
-    predictions=predict(lazyModel, observations),
-    residuals=ebeSubtract(outcomes, predictions),
-    sumSqErr=sumSq(residuals))
-----
+image::images/math-expressions/redwine1.png[]
 
-This expression returns the following response:
+The residuals plot plots the prediction values on the *x* axis and the error for the
+prediction on the *y* axis. The scatter plot shows how the errors
+the errors are distributed across the range of predictions.
 
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "sumSqErr": 1920290.1204126712
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 3796
-      }
-    ]
-  }
-}
-----
+The residuals plot can be interpreted to understand how the KNN regression performed on the
+training data.
 
-=== Setting Feature Scaling
+* The plot shows that the prediction error appear to be fairly evenly distributed
+above and below zero. The density of the errors increases as it approaches zero. The larger
+bubbles size reflects the count of the errors at that point in the plot. This provides
+and intuitive feel for the models error.
 
-If the features in the observation matrix are not in the same scale then the larger features
-will carry more weight in the distance calculation then the smaller features. This can greatly
-impact the accuracy of the prediction. The `knnRegress` function has a `scale` parameter which
-can be set to `true` to automatically scale the features in the same range.
+* The plot also shows the variance of the error at the different levels of
+prediction. This provides an understanding of how well KKN regression is working
+for the entire range of predictions.
 
-The example below shows `knnRegress` with feature scaling turned on.
+The residuals can also be visualized using a histogram to better understand
+the shape of the residuals distribution. The example below shows the same KNN
+regression as above but with a plot of the distribution of the errors.
 
-Notice that when feature scaling is turned on the `sumSqErr` in the output is much lower.
-This shows how much more accurate the predictions are when feature scaling is turned on in
-this particular example. This is because the `filesize_d` feature is significantly larger then
-the `service_d` feature.
+In the example the `zplot` function is used to plot the `empiricalDistribution`
+function of the residuals, with an 11 bin histogram.
 
-[source,text]
-----
-let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
-    filesizes=col(samples, filesize_d),
-    serviceLevels=col(samples, service_d),
-    outcomes=col(samples, response_d),
-    observations=transpose(matrix(filesizes, serviceLevels)),
-    lazyModel=knnRegress(observations, outcomes , 5, scale=true),
-    predictions=predict(lazyModel, observations),
-    residuals=ebeSubtract(outcomes, predictions),
-    sumSqErr=sumSq(residuals))
-----
+image::images/math-expressions/redwine2.png[]
 
-This expression returns the following response:
+Notice that the errors follow a bell curve centered close to 0. From this plot
+we can see the probability of getting prediction errors between -1 and 1 is quite high.
 
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "sumSqErr": 4076.794951120683
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 3790
-      }
-    ]
-  }
-}
-----
+*Additional KNN Regression Parameters*
+
+The `knnRegression` parameter has three additional parameters that make it suitable for many
+different regession scenarios.
 
+1) Any of the distance measures can be used to used for the regression simply by adding the function
+to the call. The allows for regression analysis over sparse vectors (cosine), dense vectors and
+geo-spatial lat/lon vectors (haversineMeters).
 
-=== Setting Robust Regression
+Sample syntax:
 
-The default prediction approach is to take the mean of the outcomes of the k-nearest
-neighbors. If the outcomes contain outliers the mean value can be skewed. Setting
-the `robust` parameter to `true` will take the median outcome of the k-nearest neighbors.
-This provides a regression prediction that is robust to outliers.
+[source,text]
+----
+r=knnRegress(obs, quality, 5, cosine()),
+----
 
-=== Setting the Distance Measure
+2) The `robust` named parameter can be used to perform a regression analysis that is robust
+to outliers in the outcomes. When the `robust` named parameter is used the median outcome
+of the K nearest neighbors is used rather then then average.
 
-The distance measure can be changed for the k-nearest neighbor search by adding a distance measure
-function to the `knnRegress` parameters. Below is an example using `manhattan` distance.
+Sample syntax:
 
 [source,text]
 ----
-let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
-    filesizes=col(samples, filesize_d),
-    serviceLevels=col(samples, service_d),
-    outcomes=col(samples, response_d),
-    observations=transpose(matrix(filesizes, serviceLevels)),
-    lazyModel=knnRegress(observations, outcomes, 5, manhattan(), scale=true),
-    predictions=predict(lazyModel, observations),
-    residuals=ebeSubtract(outcomes, predictions),
-    sumSqErr=sumSq(residuals))
+r=knnRegress(obs, quality, 5, robust="true"),
 ----
 
-This expression returns the following response:
+3) The `scale` named parameter can be used to scale the columns of the observations and search vectors
+at prediction time. This can improve the performance of the KNN regression when the feature columns
+are at such different scales that the distance calculations are over-weighted on specific columns.
 
-[source,json]
+Sample syntax:
+
+[source,text]
 ----
-{
-  "result-set": {
-    "docs": [
-      {
-        "sumSqErr": 4761.221942288098
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 3571
-      }
-    ]
-  }
-}
+r=knnRegress(obs, quality, 5, scale="true"),
 ----
 
 
+
+
+
+
+
+
 == K-Means Clustering
 
 The `kmeans` functions performs k-means clustering of the rows of a matrix.