You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by jb...@apache.org on 2019/06/15 01:13:17 UTC

[lucene-solr] branch SOLR-13105-visual updated: SOLR-13105: Regression visualization WIP

This is an automated email from the ASF dual-hosted git repository.

jbernste pushed a commit to branch SOLR-13105-visual
in repository https://gitbox.apache.org/repos/asf/lucene-solr.git


The following commit(s) were added to refs/heads/SOLR-13105-visual by this push:
     new 8f5bec6  SOLR-13105: Regression visualization WIP
8f5bec6 is described below

commit 8f5bec6322e4b1533fab5f74d15ac7e18860b43d
Author: Joel Bernstein <jb...@apache.org>
AuthorDate: Fri Jun 14 21:12:47 2019 -0400

    SOLR-13105: Regression visualization WIP
---
 .../src/images/math-expressions/diagnostics.png    | Bin 0 -> 150747 bytes
 .../images/math-expressions/regression-plot.png    | Bin 0 -> 229605 bytes
 .../src/images/math-expressions/residual-plot.png  | Bin 0 -> 283599 bytes
 solr/solr-ref-guide/src/math-expressions.adoc      |   6 +-
 solr/solr-ref-guide/src/regression.adoc            |  95 ++++++++++++---------
 solr/solr-ref-guide/src/vector-math.adoc           |  28 ++++++
 6 files changed, 88 insertions(+), 41 deletions(-)

diff --git a/solr/solr-ref-guide/src/images/math-expressions/diagnostics.png b/solr/solr-ref-guide/src/images/math-expressions/diagnostics.png
new file mode 100644
index 0000000..192856e
Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/diagnostics.png differ
diff --git a/solr/solr-ref-guide/src/images/math-expressions/regression-plot.png b/solr/solr-ref-guide/src/images/math-expressions/regression-plot.png
new file mode 100644
index 0000000..e68a790
Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/regression-plot.png differ
diff --git a/solr/solr-ref-guide/src/images/math-expressions/residual-plot.png b/solr/solr-ref-guide/src/images/math-expressions/residual-plot.png
new file mode 100644
index 0000000..e0dc360
Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/residual-plot.png differ
diff --git a/solr/solr-ref-guide/src/math-expressions.adoc b/solr/solr-ref-guide/src/math-expressions.adoc
index 2fbb95c..3cb8eed 100644
--- a/solr/solr-ref-guide/src/math-expressions.adoc
+++ b/solr/solr-ref-guide/src/math-expressions.adoc
@@ -31,11 +31,11 @@ image::images/math-expressions/curve-fitting.png[]
 
 *<<scalar-math.adoc#scalar-math,Scalar Math>>*: The functions for applying math to numbers and visualizing results.
 
-*<<vector-math.adoc#vector-math,Vector Math>>*: Vector create, math, manipulation and visualization.
+*<<vector-math.adoc#vector-math,Vector Math>>*: Vector creation, math, manipulation and visualization.
 
 *<<variables.adoc#variables,Variables and Caching>>*: Assigning, visualizing and caching variables.
 
-*<<matrix-math.adoc#matrix-math,Matrix Math>>*: Matrix creation, manipulation, and matrix math.
+*<<matrix-math.adoc#matrix-math,Matrix Math>>*: Matrix creation, math and manipulation.
 
 *<<vectorization.adoc#vectorization,Streams and Vectorization>>*: Retrieving streams and vectorizing numeric and lat/lon location fields.
 
@@ -55,7 +55,7 @@ image::images/math-expressions/curve-fitting.png[]
 
 *<<curve-fitting.adoc#curve-fitting,Curve Fitting>>*: Polynomial, Harmonic and Gaussian curve fitting.
 
-*<<time-series.adoc#time-series,Time Series>>*: Aggregation, smoothing and differencing of time series.
+*<<time-series.adoc#time-series,Time Series>>*: Aggregation, smoothing, differencing and anomaly detection of time series.
 
 *<<machine-learning.adoc#machine-learning,Machine Learning>>*: Functions used in machine learning.
 
diff --git a/solr/solr-ref-guide/src/regression.adoc b/solr/solr-ref-guide/src/regression.adoc
index 4ec23c3..4275b00 100644
--- a/solr/solr-ref-guide/src/regression.adoc
+++ b/solr/solr-ref-guide/src/regression.adoc
@@ -35,10 +35,10 @@ analysis.
 
 [source,text]
 ----
-let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
-    b=col(a, filesize_d),
-    c=col(a, response_d),
-    d=regress(b, c))
+let(a=random(testapp, q="*:*", rows="50000", fl="filesize_d, response_d"),
+    x=col(a, filesize_d),
+    y=col(a, response_d),
+    r=regress(x, y))
 ----
 
 Note that in this regression analysis the value of `RSquared` is `.75`. This means that changes in
@@ -50,29 +50,32 @@ Note that in this regression analysis the value of `RSquared` is `.75`. This mea
   "result-set": {
     "docs": [
       {
-        "d": {
-          "significance": 0,
-          "totalSumSquares": 10564812.895147054,
-          "R": 0.8674822407146515,
-          "RSquared": 0.7525254379553127,
-          "meanSquareError": 523.1137343558588,
-          "intercept": -49.528134913099095,
-          "slopeConfidenceInterval": 0.0003171801710329995,
-          "regressionSumSquares": 7950290.450836472,
-          "slope": 0.019945557923159506,
-          "interceptStdErr": 6.489732340389941,
-          "N": 5000
-        }
+        "significance": 0,
+        "totalSumSquares": 96595678.64838874,
+        "R": 0.9052835767815126,
+        "RSquared": 0.8195383543903288,
+        "meanSquareError": 348.6502485633668,
+        "intercept": 55.64040842391729,
+        "slopeConfidenceInterval": 0.0000822026526346821,
+        "regressionSumSquares": 79163863.52071753,
+        "slope": 0.019984612363694493,
+        "interceptStdErr": 1.6792610845256566,
+        "N": 50000
       },
       {
         "EOF": true,
-        "RESPONSE_TIME": 98
+        "RESPONSE_TIME": 344
       }
     ]
   }
 }
 ----
 
+The diagnostics can be visualized in a table using Zeppelin-Solr.
+
+image::images/math-expressions/diagnostics.png[]
+
+
 === Prediction
 
 The `predict` function uses the regression model to make predictions.
@@ -84,11 +87,11 @@ the value of `response_d` for the `filesize_d` value of `40000`.
 
 [source,text]
 ----
-let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
-    b=col(a, filesize_d),
-    c=col(a, response_d),
-    d=regress(b, c),
-    e=predict(d, 40000))
+let(a=random(testapp, q="*:*", rows="5000", fl="filesize_d, response_d"),
+    x=col(a, filesize_d),
+    y=col(a, response_d),
+    r=regress(x, y),
+    p=predict(r, 40000))
 ----
 
 When this expression is sent to the `/stream` handler it responds with:
@@ -99,7 +102,7 @@ When this expression is sent to the `/stream` handler it responds with:
   "result-set": {
     "docs": [
       {
-        "e": 748.079241022975
+        "p": 748.079241022975
       },
       {
         "EOF": true,
@@ -119,11 +122,11 @@ In this case 5000 predictions are returned.
 
 [source,text]
 ----
-let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
-    b=col(a, filesize_d),
-    c=col(a, response_d),
-    d=regress(b, c),
-    e=predict(d, b))
+let(a=random(testapp, q="*:*", rows="5000", fl="filesize_d, response_d"),
+    x=col(a, filesize_d),
+    y=col(a, response_d),
+    r=regress(x, y),
+    p=predict(r, x))
 ----
 
 When this expression is sent to the `/stream` handler it responds with:
@@ -134,7 +137,7 @@ When this expression is sent to the `/stream` handler it responds with:
   "result-set": {
     "docs": [
       {
-        "e": [
+        "p": [
           742.2525322514165,
           709.6972488729955,
           687.8382568904871,
@@ -158,25 +161,33 @@ When this expression is sent to the `/stream` handler it responds with:
 }
 ----
 
+=== Regression Plot
+
+Using *zplot* and the Zeppelin-Solr interpreter we can visualize both the observations and the predictions in
+the same scatter plot. In the example below zplot is plotting the filesize_d observations on the
+*x* axis, the response_d observations on the *y* access and the predictions on the *y1* access.
+
+image::images/math-expressions/regression-plot.png[]
+
 === Residuals
 
 The difference between the observed value and the predicted value is known as the
 residual. There isn't a specific function to calculate the residuals but vector
 math can used to perform the calculation.
 
-In the example below the predictions are stored in variable *`e`*. The `ebeSubtract`
+In the example below the predictions are stored in variable *`p`*. The `ebeSubtract`
 function is then used to subtract the predictions
-from the actual `response_d` values stored in variable *`c`*. Variable *`f`* contains
+from the actual `response_d` values stored in variable *`y`*. Variable *`e`* contains
 the array of residuals.
 
 [source,text]
 ----
-let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
-    b=col(a, filesize_d),
-    c=col(a, response_d),
-    d=regress(b, c),
-    e=predict(d, b),
-    f=ebeSubtract(c, e))
+let(a=random(testapp, q="*:*", rows="500", fl="filesize_d, response_d"),
+    x=col(a, filesize_d),
+    y=col(a, response_d),
+    r=regress(x, y),
+    p=predict(r, x),
+    e=ebeSubtract(y, p))
 ----
 
 When this expression is sent to the `/stream` handler it responds with:
@@ -213,6 +224,14 @@ When this expression is sent to the `/stream` handler it responds with:
 }
 ----
 
+=== Residual Plot
+
+Using *zplot* and Zeppelin-Solr we can visualize the residuals with
+a residuals plot.
+
+image::images/math-expressions/residual-plot.png[]
+
+
 == Multivariate Linear Regression
 
 The `olsRegress` function performs a multivariate linear regression analysis. Multivariate linear
diff --git a/solr/solr-ref-guide/src/vector-math.adoc b/solr/solr-ref-guide/src/vector-math.adoc
index bfa6c91..3f3abcd 100644
--- a/solr/solr-ref-guide/src/vector-math.adoc
+++ b/solr/solr-ref-guide/src/vector-math.adoc
@@ -277,6 +277,34 @@ When this expression is sent to the `/stream` handler it responds with:
 }
 ----
 
+== Getting Values By Index
+
+Values from a vector can be retrieved by index with the *valueAt* function.
+
+[source,text]
+----
+valueAt(array(0,1,2,3,4,5,6), 2)
+----
+
+When this expression is sent to the `/stream` handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": 2
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
 == Sequences
 
 The *sequence* function can be used to generate a sequence of numbers as an array.