You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by jb...@apache.org on 2019/06/15 01:13:17 UTC
[lucene-solr] branch SOLR-13105-visual updated: SOLR-13105:
Regression visualization WIP
This is an automated email from the ASF dual-hosted git repository.
jbernste pushed a commit to branch SOLR-13105-visual
in repository https://gitbox.apache.org/repos/asf/lucene-solr.git
The following commit(s) were added to refs/heads/SOLR-13105-visual by this push:
new 8f5bec6 SOLR-13105: Regression visualization WIP
8f5bec6 is described below
commit 8f5bec6322e4b1533fab5f74d15ac7e18860b43d
Author: Joel Bernstein <jb...@apache.org>
AuthorDate: Fri Jun 14 21:12:47 2019 -0400
SOLR-13105: Regression visualization WIP
---
.../src/images/math-expressions/diagnostics.png | Bin 0 -> 150747 bytes
.../images/math-expressions/regression-plot.png | Bin 0 -> 229605 bytes
.../src/images/math-expressions/residual-plot.png | Bin 0 -> 283599 bytes
solr/solr-ref-guide/src/math-expressions.adoc | 6 +-
solr/solr-ref-guide/src/regression.adoc | 95 ++++++++++++---------
solr/solr-ref-guide/src/vector-math.adoc | 28 ++++++
6 files changed, 88 insertions(+), 41 deletions(-)
diff --git a/solr/solr-ref-guide/src/images/math-expressions/diagnostics.png b/solr/solr-ref-guide/src/images/math-expressions/diagnostics.png
new file mode 100644
index 0000000..192856e
Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/diagnostics.png differ
diff --git a/solr/solr-ref-guide/src/images/math-expressions/regression-plot.png b/solr/solr-ref-guide/src/images/math-expressions/regression-plot.png
new file mode 100644
index 0000000..e68a790
Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/regression-plot.png differ
diff --git a/solr/solr-ref-guide/src/images/math-expressions/residual-plot.png b/solr/solr-ref-guide/src/images/math-expressions/residual-plot.png
new file mode 100644
index 0000000..e0dc360
Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/residual-plot.png differ
diff --git a/solr/solr-ref-guide/src/math-expressions.adoc b/solr/solr-ref-guide/src/math-expressions.adoc
index 2fbb95c..3cb8eed 100644
--- a/solr/solr-ref-guide/src/math-expressions.adoc
+++ b/solr/solr-ref-guide/src/math-expressions.adoc
@@ -31,11 +31,11 @@ image::images/math-expressions/curve-fitting.png[]
*<<scalar-math.adoc#scalar-math,Scalar Math>>*: The functions for applying math to numbers and visualizing results.
-*<<vector-math.adoc#vector-math,Vector Math>>*: Vector create, math, manipulation and visualization.
+*<<vector-math.adoc#vector-math,Vector Math>>*: Vector creation, math, manipulation and visualization.
*<<variables.adoc#variables,Variables and Caching>>*: Assigning, visualizing and caching variables.
-*<<matrix-math.adoc#matrix-math,Matrix Math>>*: Matrix creation, manipulation, and matrix math.
+*<<matrix-math.adoc#matrix-math,Matrix Math>>*: Matrix creation, math and manipulation.
*<<vectorization.adoc#vectorization,Streams and Vectorization>>*: Retrieving streams and vectorizing numeric and lat/lon location fields.
@@ -55,7 +55,7 @@ image::images/math-expressions/curve-fitting.png[]
*<<curve-fitting.adoc#curve-fitting,Curve Fitting>>*: Polynomial, Harmonic and Gaussian curve fitting.
-*<<time-series.adoc#time-series,Time Series>>*: Aggregation, smoothing and differencing of time series.
+*<<time-series.adoc#time-series,Time Series>>*: Aggregation, smoothing, differencing and anomaly detection of time series.
*<<machine-learning.adoc#machine-learning,Machine Learning>>*: Functions used in machine learning.
diff --git a/solr/solr-ref-guide/src/regression.adoc b/solr/solr-ref-guide/src/regression.adoc
index 4ec23c3..4275b00 100644
--- a/solr/solr-ref-guide/src/regression.adoc
+++ b/solr/solr-ref-guide/src/regression.adoc
@@ -35,10 +35,10 @@ analysis.
[source,text]
----
-let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
- b=col(a, filesize_d),
- c=col(a, response_d),
- d=regress(b, c))
+let(a=random(testapp, q="*:*", rows="50000", fl="filesize_d, response_d"),
+ x=col(a, filesize_d),
+ y=col(a, response_d),
+ r=regress(x, y))
----
Note that in this regression analysis the value of `RSquared` is `.75`. This means that changes in
@@ -50,29 +50,32 @@ Note that in this regression analysis the value of `RSquared` is `.75`. This mea
"result-set": {
"docs": [
{
- "d": {
- "significance": 0,
- "totalSumSquares": 10564812.895147054,
- "R": 0.8674822407146515,
- "RSquared": 0.7525254379553127,
- "meanSquareError": 523.1137343558588,
- "intercept": -49.528134913099095,
- "slopeConfidenceInterval": 0.0003171801710329995,
- "regressionSumSquares": 7950290.450836472,
- "slope": 0.019945557923159506,
- "interceptStdErr": 6.489732340389941,
- "N": 5000
- }
+ "significance": 0,
+ "totalSumSquares": 96595678.64838874,
+ "R": 0.9052835767815126,
+ "RSquared": 0.8195383543903288,
+ "meanSquareError": 348.6502485633668,
+ "intercept": 55.64040842391729,
+ "slopeConfidenceInterval": 0.0000822026526346821,
+ "regressionSumSquares": 79163863.52071753,
+ "slope": 0.019984612363694493,
+ "interceptStdErr": 1.6792610845256566,
+ "N": 50000
},
{
"EOF": true,
- "RESPONSE_TIME": 98
+ "RESPONSE_TIME": 344
}
]
}
}
----
+The diagnostics can be visualized in a table using Zeppelin-Solr.
+
+image::images/math-expressions/diagnostics.png[]
+
+
=== Prediction
The `predict` function uses the regression model to make predictions.
@@ -84,11 +87,11 @@ the value of `response_d` for the `filesize_d` value of `40000`.
[source,text]
----
-let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
- b=col(a, filesize_d),
- c=col(a, response_d),
- d=regress(b, c),
- e=predict(d, 40000))
+let(a=random(testapp, q="*:*", rows="5000", fl="filesize_d, response_d"),
+ x=col(a, filesize_d),
+ y=col(a, response_d),
+ r=regress(x, y),
+ p=predict(r, 40000))
----
When this expression is sent to the `/stream` handler it responds with:
@@ -99,7 +102,7 @@ When this expression is sent to the `/stream` handler it responds with:
"result-set": {
"docs": [
{
- "e": 748.079241022975
+ "p": 748.079241022975
},
{
"EOF": true,
@@ -119,11 +122,11 @@ In this case 5000 predictions are returned.
[source,text]
----
-let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
- b=col(a, filesize_d),
- c=col(a, response_d),
- d=regress(b, c),
- e=predict(d, b))
+let(a=random(testapp, q="*:*", rows="5000", fl="filesize_d, response_d"),
+ x=col(a, filesize_d),
+ y=col(a, response_d),
+ r=regress(x, y),
+ p=predict(r, x))
----
When this expression is sent to the `/stream` handler it responds with:
@@ -134,7 +137,7 @@ When this expression is sent to the `/stream` handler it responds with:
"result-set": {
"docs": [
{
- "e": [
+ "p": [
742.2525322514165,
709.6972488729955,
687.8382568904871,
@@ -158,25 +161,33 @@ When this expression is sent to the `/stream` handler it responds with:
}
----
+=== Regression Plot
+
+Using *zplot* and the Zeppelin-Solr interpreter we can visualize both the observations and the predictions in
+the same scatter plot. In the example below zplot is plotting the filesize_d observations on the
+*x* axis, the response_d observations on the *y* access and the predictions on the *y1* access.
+
+image::images/math-expressions/regression-plot.png[]
+
=== Residuals
The difference between the observed value and the predicted value is known as the
residual. There isn't a specific function to calculate the residuals but vector
math can used to perform the calculation.
-In the example below the predictions are stored in variable *`e`*. The `ebeSubtract`
+In the example below the predictions are stored in variable *`p`*. The `ebeSubtract`
function is then used to subtract the predictions
-from the actual `response_d` values stored in variable *`c`*. Variable *`f`* contains
+from the actual `response_d` values stored in variable *`y`*. Variable *`e`* contains
the array of residuals.
[source,text]
----
-let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
- b=col(a, filesize_d),
- c=col(a, response_d),
- d=regress(b, c),
- e=predict(d, b),
- f=ebeSubtract(c, e))
+let(a=random(testapp, q="*:*", rows="500", fl="filesize_d, response_d"),
+ x=col(a, filesize_d),
+ y=col(a, response_d),
+ r=regress(x, y),
+ p=predict(r, x),
+ e=ebeSubtract(y, p))
----
When this expression is sent to the `/stream` handler it responds with:
@@ -213,6 +224,14 @@ When this expression is sent to the `/stream` handler it responds with:
}
----
+=== Residual Plot
+
+Using *zplot* and Zeppelin-Solr we can visualize the residuals with
+a residuals plot.
+
+image::images/math-expressions/residual-plot.png[]
+
+
== Multivariate Linear Regression
The `olsRegress` function performs a multivariate linear regression analysis. Multivariate linear
diff --git a/solr/solr-ref-guide/src/vector-math.adoc b/solr/solr-ref-guide/src/vector-math.adoc
index bfa6c91..3f3abcd 100644
--- a/solr/solr-ref-guide/src/vector-math.adoc
+++ b/solr/solr-ref-guide/src/vector-math.adoc
@@ -277,6 +277,34 @@ When this expression is sent to the `/stream` handler it responds with:
}
----
+== Getting Values By Index
+
+Values from a vector can be retrieved by index with the *valueAt* function.
+
+[source,text]
+----
+valueAt(array(0,1,2,3,4,5,6), 2)
+----
+
+When this expression is sent to the `/stream` handler it responds with:
+
+[source,json]
+----
+{
+ "result-set": {
+ "docs": [
+ {
+ "return-value": 2
+ },
+ {
+ "EOF": true,
+ "RESPONSE_TIME": 0
+ }
+ ]
+ }
+}
+----
+
== Sequences
The *sequence* function can be used to generate a sequence of numbers as an array.